Yes, of course. Now beginning build world and test ) 14.02.2014, 04:41, "Andrew Beekhof" <and...@beekhof.net>: > The previous patch wasn't quite right. > Could you try this new one? > > http://paste.fedoraproject.org/77123/13923376/ > > [11:23 AM] beekhof@f19 ~/Development/sources/pacemaker/devel ☺ # git diff > diff --git a/crmd/callbacks.c b/crmd/callbacks.c > index ac4b905..d49525b 100644 > --- a/crmd/callbacks.c > +++ b/crmd/callbacks.c > @@ -199,8 +199,7 @@ peer_update_callback(enum crm_status_type type, > crm_node_t * node, const void *d > stop_te_timer(down->timer); > > flags |= node_update_join | node_update_expected; > - crm_update_peer_join(__FUNCTION__, node, crm_join_none); > - crm_update_peer_expected(__FUNCTION__, node, > CRMD_JOINSTATE_DOWN); > + crmd_peer_down(node, FALSE); > check_join_state(fsa_state, __FUNCTION__); > > update_graph(transition_graph, down); > diff --git a/crmd/crmd_utils.h b/crmd/crmd_utils.h > index bc472c2..1a2577a 100644 > --- a/crmd/crmd_utils.h > +++ b/crmd/crmd_utils.h > @@ -100,6 +100,7 @@ void crmd_join_phase_log(int level); > const char *get_timer_desc(fsa_timer_t * timer); > gboolean too_many_st_failures(void); > void st_fail_count_reset(const char * target); > +void crmd_peer_down(crm_node_t *peer, bool full); > > # define fsa_register_cib_callback(id, flag, data, fn) do { \ > fsa_cib_conn->cmds->register_callback( \ > diff --git a/crmd/te_actions.c b/crmd/te_actions.c > index f31d4ec..3bfce59 100644 > --- a/crmd/te_actions.c > +++ b/crmd/te_actions.c > @@ -80,11 +80,8 @@ send_stonith_update(crm_action_t * action, const char > *target, const char *uuid) > crm_info("Recording uuid '%s' for node '%s'", uuid, target); > peer->uuid = strdup(uuid); > } > - crm_update_peer_proc(__FUNCTION__, peer, crm_proc_none, NULL); > - crm_update_peer_state(__FUNCTION__, peer, CRM_NODE_LOST, 0); > - crm_update_peer_expected(__FUNCTION__, peer, CRMD_JOINSTATE_DOWN); > - crm_update_peer_join(__FUNCTION__, peer, crm_join_none); > > + crmd_peer_down(peer, TRUE); > node_state = > do_update_node_cib(peer, > node_update_cluster | node_update_peer | > node_update_join | > diff --git a/crmd/te_utils.c b/crmd/te_utils.c > index ad7e573..0c92e95 100644 > --- a/crmd/te_utils.c > +++ b/crmd/te_utils.c > @@ -247,10 +247,7 @@ tengine_stonith_notify(stonith_t * st, stonith_event_t * > st_event) > > } > > - crm_update_peer_proc(__FUNCTION__, peer, crm_proc_none, NULL); > - crm_update_peer_state(__FUNCTION__, peer, CRM_NODE_LOST, 0); > - crm_update_peer_expected(__FUNCTION__, peer, CRMD_JOINSTATE_DOWN); > - crm_update_peer_join(__FUNCTION__, peer, crm_join_none); > + crmd_peer_down(peer, TRUE); > } > } > > diff --git a/crmd/utils.c b/crmd/utils.c > index 3988cfe..2df53ab 100644 > --- a/crmd/utils.c > +++ b/crmd/utils.c > @@ -1077,3 +1077,13 @@ update_attrd_remote_node_removed(const char *host, > const char *user_name) > crm_trace("telling attrd to clear attributes for remote host %s", host); > update_attrd_helper(host, NULL, NULL, user_name, TRUE, 'C'); > } > + > +void crmd_peer_down(crm_node_t *peer, bool full) > +{ > + if(full && peer->state == NULL) { > + crm_update_peer_state(__FUNCTION__, peer, CRM_NODE_LOST, 0); > + crm_update_peer_proc(__FUNCTION__, peer, crm_proc_none, NULL); > + } > + crm_update_peer_join(__FUNCTION__, peer, crm_join_none); > + crm_update_peer_expected(__FUNCTION__, peer, CRMD_JOINSTATE_DOWN); > +} > > On 16 Jan 2014, at 7:24 pm, Andrey Groshev <gre...@yandex.ru> wrote: > >> 16.01.2014, 01:30, "Andrew Beekhof" <and...@beekhof.net>: >>> On 16 Jan 2014, at 12:41 am, Andrey Groshev <gre...@yandex.ru> wrote: >>>> 15.01.2014, 02:53, "Andrew Beekhof" <and...@beekhof.net>: >>>>> On 15 Jan 2014, at 12:15 am, Andrey Groshev <gre...@yandex.ru> wrote: >>>>>> 14.01.2014, 10:00, "Andrey Groshev" <gre...@yandex.ru>: >>>>>>> 14.01.2014, 07:47, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>> Ok, here's what happens: >>>>>>>> >>>>>>>> 1. node2 is lost >>>>>>>> 2. fencing of node2 starts >>>>>>>> 3. node2 reboots (and cluster starts) >>>>>>>> 4. node2 returns to the membership >>>>>>>> 5. node2 is marked as a cluster member >>>>>>>> 6. DC tries to bring it into the cluster, but needs to cancel the >>>>>>>> active transition first. >>>>>>>> Which is a problem since the node2 fencing operation is part of >>>>>>>> that >>>>>>>> 7. node2 is in a transition (pending) state until fencing passes >>>>>>>> or fails >>>>>>>> 8a. fencing fails: transition completes and the node joins the >>>>>>>> cluster >>>>>>>> >>>>>>>> Thats in theory, except we automatically try again. Which isn't >>>>>>>> appropriate. >>>>>>>> This should be relatively easy to fix. >>>>>>>> >>>>>>>> 8b. fencing passes: the node is incorrectly marked as offline >>>>>>>> >>>>>>>> This I have no idea how to fix yet. >>>>>>>> >>>>>>>> On another note, it doesn't look like this agent works at all. >>>>>>>> The node has been back online for a long time and the agent is >>>>>>>> still timing out after 10 minutes. >>>>>>>> So "Once the script makes sure that the victim will rebooted and >>>>>>>> again available via ssh - it exit with 0." does not seem true. >>>>>>> Damn. Looks like you're right. At some time I broke my agent and had >>>>>>> not noticed it. Who will understand. >>>>>> I repaired my agent - after send reboot he is wait STDIN. >>>>>> Returned "normally" a behavior - hangs "pending", until manually send >>>>>> reboot. :) >>>>> Right. Now you're in case 8b. >>>>> >>>>> Can you try this patch: http://paste.fedoraproject.org/68450/38973966 >>>> Killed all day experiences. >>>> It turns out here that: >>>> 1. Did cluster. >>>> 2. On the node-2 send signal (-4) - killed corosink >>>> 3. From node-1 (there DC) - stonith sent reboot >>>> 4. Noda rebooted and resources start. >>>> 5. Again. On the node-2 send signal (-4) - killed corosink >>>> 6. Again. From node-1 (there DC) - stonith sent reboot >>>> 7. Noda-2 rebooted and hangs in "pending" >>>> 8. Waiting, waiting..... manually reboot. >>>> 9. Noda-2 reboot and raised resources start. >>>> 10. GOTO p.2 >>> Logs? >> Yesterday I wrote an additional letter why not put the logs. >> Read it please, it contains a few more questions. >> Today again began to hang and continue along the same cycle. >> Logs here http://send2me.ru/crmrep2.tar.bz2 >>>>>> New logs: http://send2me.ru/crmrep1.tar.bz2 >>>>>>>> On 14 Jan 2014, at 1:19 pm, Andrew Beekhof <and...@beekhof.net> >>>>>>>> wrote: >>>>>>>>> Apart from anything else, your timeout needs to be bigger: >>>>>>>>> >>>>>>>>> Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru >>>>>>>>> stonith-ng: ( commands.c:1321 ) error: log_operation: Operation >>>>>>>>> 'reboot' [11331] (call 2 from crmd.17227) for host >>>>>>>>> 'dev-cluster2-node2.unix.tensor.ru' with device 'st1' returned: -62 >>>>>>>>> (Timer expired) >>>>>>>>> >>>>>>>>> On 14 Jan 2014, at 7:18 am, Andrew Beekhof <and...@beekhof.net> >>>>>>>>> wrote: >>>>>>>>>> On 13 Jan 2014, at 8:31 pm, Andrey Groshev <gre...@yandex.ru> >>>>>>>>>> wrote: >>>>>>>>>>> 13.01.2014, 02:51, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>>>>> On 10 Jan 2014, at 9:55 pm, Andrey Groshev <gre...@yandex.ru> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> 10.01.2014, 14:31, "Andrey Groshev" <gre...@yandex.ru>: >>>>>>>>>>>>>> 10.01.2014, 14:01, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>>>>>>>> On 10 Jan 2014, at 5:03 pm, Andrey Groshev >>>>>>>>>>>>>>> <gre...@yandex.ru> wrote: >>>>>>>>>>>>>>>> 10.01.2014, 05:29, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>>>>>>>>>> On 9 Jan 2014, at 11:11 pm, Andrey Groshev >>>>>>>>>>>>>>>>> <gre...@yandex.ru> wrote: >>>>>>>>>>>>>>>>>> 08.01.2014, 06:22, "Andrew Beekhof" >>>>>>>>>>>>>>>>>> <and...@beekhof.net>: >>>>>>>>>>>>>>>>>>> On 29 Nov 2013, at 7:17 pm, Andrey Groshev >>>>>>>>>>>>>>>>>>> <gre...@yandex.ru> wrote: >>>>>>>>>>>>>>>>>>>> Hi, ALL. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I'm still trying to cope with the fact that after >>>>>>>>>>>>>>>>>>>> the fence - node hangs in "pending". >>>>>>>>>>>>>>>>>>> Please define "pending". Where did you see this? >>>>>>>>>>>>>>>>>> In crm_mon: >>>>>>>>>>>>>>>>>> ...... >>>>>>>>>>>>>>>>>> Node dev-cluster2-node2 (172793105): pending >>>>>>>>>>>>>>>>>> ...... >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The experiment was like this: >>>>>>>>>>>>>>>>>> Four nodes in cluster. >>>>>>>>>>>>>>>>>> On one of them kill corosync or pacemakerd (signal 4 >>>>>>>>>>>>>>>>>> or 6 oк 11). >>>>>>>>>>>>>>>>>> Thereafter, the remaining start it constantly >>>>>>>>>>>>>>>>>> reboot, under various pretexts, "softly whistling", "fly >>>>>>>>>>>>>>>>>> low", "not a cluster member!" ... >>>>>>>>>>>>>>>>>> Then in the log fell out "Too many failures ...." >>>>>>>>>>>>>>>>>> All this time in the status in crm_mon is "pending". >>>>>>>>>>>>>>>>>> Depending on the wind direction changed to "UNCLEAN" >>>>>>>>>>>>>>>>>> Much time has passed and I can not accurately >>>>>>>>>>>>>>>>>> describe the behavior... >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Now I am in the following state: >>>>>>>>>>>>>>>>>> I tried locate the problem. Came here with this. >>>>>>>>>>>>>>>>>> I set big value in property stonith-timeout="600s". >>>>>>>>>>>>>>>>>> And got the following behavior: >>>>>>>>>>>>>>>>>> 1. pkill -4 corosync >>>>>>>>>>>>>>>>>> 2. from node with DC call my fence agent "sshbykey" >>>>>>>>>>>>>>>>>> 3. It sends reboot victim and waits until she comes >>>>>>>>>>>>>>>>>> to life again. >>>>>>>>>>>>>>>>> Hmmm.... what version of pacemaker? >>>>>>>>>>>>>>>>> This sounds like a timing issue that we fixed a while >>>>>>>>>>>>>>>>> back >>>>>>>>>>>>>>>> Was a version 1.1.11 from December 3. >>>>>>>>>>>>>>>> Now try full update and retest. >>>>>>>>>>>>>>> That should be recent enough. Can you create a crm_report >>>>>>>>>>>>>>> the next time you reproduce? >>>>>>>>>>>>>> Of course yes. Little delay.... :) >>>>>>>>>>>>>> >>>>>>>>>>>>>> ...... >>>>>>>>>>>>>> cc1: warnings being treated as errors >>>>>>>>>>>>>> upstart.c: In function ‘upstart_job_property’: >>>>>>>>>>>>>> upstart.c:264: error: implicit declaration of function >>>>>>>>>>>>>> ‘g_variant_lookup_value’ >>>>>>>>>>>>>> upstart.c:264: error: nested extern declaration of >>>>>>>>>>>>>> ‘g_variant_lookup_value’ >>>>>>>>>>>>>> upstart.c:264: error: assignment makes pointer from integer >>>>>>>>>>>>>> without a cast >>>>>>>>>>>>>> gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1 >>>>>>>>>>>>>> gmake[2]: Leaving directory >>>>>>>>>>>>>> `/root/ha/pacemaker/lib/services' >>>>>>>>>>>>>> make[1]: *** [all-recursive] Error 1 >>>>>>>>>>>>>> make[1]: Leaving directory `/root/ha/pacemaker/lib' >>>>>>>>>>>>>> make: *** [core] Error 1 >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm trying to solve this a problem. >>>>>>>>>>>>> Do not get solved quickly... >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value >>>>>>>>>>>>> g_variant_lookup_value () Since 2.28 >>>>>>>>>>>>> >>>>>>>>>>>>> # yum list installed glib2 >>>>>>>>>>>>> Loaded plugins: fastestmirror, rhnplugin, security >>>>>>>>>>>>> This system is receiving updates from RHN Classic or Red Hat >>>>>>>>>>>>> Satellite. >>>>>>>>>>>>> Loading mirror speeds from cached hostfile >>>>>>>>>>>>> Installed Packages >>>>>>>>>>>>> glib2.x86_64 >>>>>>>>>>>>> 2.26.1-3.el6 >>>>>>>>>>>>> installed >>>>>>>>>>>>> >>>>>>>>>>>>> # cat /etc/issue >>>>>>>>>>>>> CentOS release 6.5 (Final) >>>>>>>>>>>>> Kernel \r on an \m >>>>>>>>>>>> Can you try this patch? >>>>>>>>>>>> Upstart jobs wont work, but the code will compile >>>>>>>>>>>> >>>>>>>>>>>> diff --git a/lib/services/upstart.c b/lib/services/upstart.c >>>>>>>>>>>> index 831e7cf..195c3a4 100644 >>>>>>>>>>>> --- a/lib/services/upstart.c >>>>>>>>>>>> +++ b/lib/services/upstart.c >>>>>>>>>>>> @@ -231,12 +231,21 @@ upstart_job_exists(const char *name) >>>>>>>>>>>> static char * >>>>>>>>>>>> upstart_job_property(const char *obj, const gchar * iface, >>>>>>>>>>>> const char *name) >>>>>>>>>>>> { >>>>>>>>>>>> + char *output = NULL; >>>>>>>>>>>> + >>>>>>>>>>>> +#if !GLIB_CHECK_VERSION(2,28,0) >>>>>>>>>>>> + static bool err = TRUE; >>>>>>>>>>>> + >>>>>>>>>>>> + if(err) { >>>>>>>>>>>> + crm_err("This version of glib is too old to support >>>>>>>>>>>> upstart jobs"); >>>>>>>>>>>> + err = FALSE; >>>>>>>>>>>> + } >>>>>>>>>>>> +#else >>>>>>>>>>>> GError *error = NULL; >>>>>>>>>>>> GDBusProxy *proxy; >>>>>>>>>>>> GVariant *asv = NULL; >>>>>>>>>>>> GVariant *value = NULL; >>>>>>>>>>>> GVariant *_ret = NULL; >>>>>>>>>>>> - char *output = NULL; >>>>>>>>>>>> >>>>>>>>>>>> crm_info("Calling GetAll on %s", obj); >>>>>>>>>>>> proxy = get_proxy(obj, BUS_PROPERTY_IFACE); >>>>>>>>>>>> @@ -272,6 +281,7 @@ upstart_job_property(const char *obj, >>>>>>>>>>>> const gchar * iface, const char *name) >>>>>>>>>>>> >>>>>>>>>>>> g_object_unref(proxy); >>>>>>>>>>>> g_variant_unref(_ret); >>>>>>>>>>>> +#endif >>>>>>>>>>>> return output; >>>>>>>>>>>> } >>>>>>>>>>> Ok :) I patch source. >>>>>>>>>>> Type "make rc" - the same error. >>>>>>>>>> Because its not building your local changes >>>>>>>>>>> Make new copy via "fetch" - the same error. >>>>>>>>>>> It seems that if not exist >>>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz, then download it. >>>>>>>>>>> Otherwise use exist archive. >>>>>>>>>>> Cutted log ....... >>>>>>>>>>> >>>>>>>>>>> # make rc >>>>>>>>>>> make TAG=Pacemaker-1.1.11-rc3 rpm >>>>>>>>>>> make[1]: Entering directory `/root/ha/pacemaker' >>>>>>>>>>> rm -f pacemaker-dirty.tar.* pacemaker-tip.tar.* >>>>>>>>>>> pacemaker-HEAD.tar.* >>>>>>>>>>> if [ ! -f ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz ]; >>>>>>>>>>> then \ >>>>>>>>>>> rm -f pacemaker.tar.*; >>>>>>>>>>> \ >>>>>>>>>>> if [ Pacemaker-1.1.11-rc3 = dirty ]; then >>>>>>>>>>> \ >>>>>>>>>>> git commit -m "DO-NOT-PUSH" -a; >>>>>>>>>>> \ >>>>>>>>>>> git archive >>>>>>>>>>> --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ HEAD | gzip > >>>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \ >>>>>>>>>>> git reset --mixed HEAD^; >>>>>>>>>>> \ >>>>>>>>>>> else >>>>>>>>>>> \ >>>>>>>>>>> git archive >>>>>>>>>>> --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ >>>>>>>>>>> Pacemaker-1.1.11-rc3 | gzip > >>>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \ >>>>>>>>>>> fi; >>>>>>>>>>> \ >>>>>>>>>>> echo `date`: Rebuilt >>>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; >>>>>>>>>>> \ >>>>>>>>>>> else >>>>>>>>>>> \ >>>>>>>>>>> echo `date`: Using existing tarball: >>>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; >>>>>>>>>>> \ >>>>>>>>>>> fi >>>>>>>>>>> Mon Jan 13 13:23:21 MSK 2014: Using existing tarball: >>>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz >>>>>>>>>>> ....... >>>>>>>>>>> >>>>>>>>>>> Well, "make rpm" - build rpms and I create cluster. >>>>>>>>>>> I spent the same tests and confirmed the behavior. >>>>>>>>>>> crm_reoprt log here - http://send2me.ru/crmrep.tar.bz2 >>>>>>>>>> Thanks! >>>>>>>> , >>>>>>>> _______________________________________________ >>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>>> >>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>> Getting started: >>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>> _______________________________________________ >>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>> >>>>>>> Project Home: http://www.clusterlabs.org >>>>>>> Getting started: >>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>> _______________________________________________ >>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>> >>>>>> Project Home: http://www.clusterlabs.org >>>>>> Getting started: >>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>> Bugs: http://bugs.clusterlabs.org >>>>> , >>>>> _______________________________________________ >>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org >>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: http://bugs.clusterlabs.org >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> , >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > , > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org