16.01.2014, 01:30, "Andrew Beekhof" <and...@beekhof.net>: > On 16 Jan 2014, at 12:41 am, Andrey Groshev <gre...@yandex.ru> wrote: > >> 15.01.2014, 02:53, "Andrew Beekhof" <and...@beekhof.net>: >>> On 15 Jan 2014, at 12:15 am, Andrey Groshev <gre...@yandex.ru> wrote: >>>> 14.01.2014, 10:00, "Andrey Groshev" <gre...@yandex.ru>: >>>>> 14.01.2014, 07:47, "Andrew Beekhof" <and...@beekhof.net>: >>>>>> Ok, here's what happens: >>>>>> >>>>>> 1. node2 is lost >>>>>> 2. fencing of node2 starts >>>>>> 3. node2 reboots (and cluster starts) >>>>>> 4. node2 returns to the membership >>>>>> 5. node2 is marked as a cluster member >>>>>> 6. DC tries to bring it into the cluster, but needs to cancel the >>>>>> active transition first. >>>>>> Which is a problem since the node2 fencing operation is part of >>>>>> that >>>>>> 7. node2 is in a transition (pending) state until fencing passes or >>>>>> fails >>>>>> 8a. fencing fails: transition completes and the node joins the cluster >>>>>> >>>>>> Thats in theory, except we automatically try again. Which isn't >>>>>> appropriate. >>>>>> This should be relatively easy to fix. >>>>>> >>>>>> 8b. fencing passes: the node is incorrectly marked as offline >>>>>> >>>>>> This I have no idea how to fix yet. >>>>>> >>>>>> On another note, it doesn't look like this agent works at all. >>>>>> The node has been back online for a long time and the agent is still >>>>>> timing out after 10 minutes. >>>>>> So "Once the script makes sure that the victim will rebooted and >>>>>> again available via ssh - it exit with 0." does not seem true. >>>>> Damn. Looks like you're right. At some time I broke my agent and had >>>>> not noticed it. Who will understand. >>>> I repaired my agent - after send reboot he is wait STDIN. >>>> Returned "normally" a behavior - hangs "pending", until manually send >>>> reboot. :) >>> Right. Now you're in case 8b. >>> >>> Can you try this patch: http://paste.fedoraproject.org/68450/38973966 >> Killed all day experiences. >> It turns out here that: >> 1. Did cluster. >> 2. On the node-2 send signal (-4) - killed corosink >> 3. From node-1 (there DC) - stonith sent reboot >> 4. Noda rebooted and resources start. >> 5. Again. On the node-2 send signal (-4) - killed corosink >> 6. Again. From node-1 (there DC) - stonith sent reboot >> 7. Noda-2 rebooted and hangs in "pending" >> 8. Waiting, waiting..... manually reboot. >> 9. Noda-2 reboot and raised resources start. >> 10. GOTO p.2 > > Logs?
Yesterday I wrote an additional letter why not put the logs. Read it please, it contains a few more questions. Today again began to hang and continue along the same cycle. Logs here http://send2me.ru/crmrep2.tar.bz2 >>>> New logs: http://send2me.ru/crmrep1.tar.bz2 >>>>>> On 14 Jan 2014, at 1:19 pm, Andrew Beekhof <and...@beekhof.net> wrote: >>>>>>> Apart from anything else, your timeout needs to be bigger: >>>>>>> >>>>>>> Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru >>>>>>> stonith-ng: ( commands.c:1321 ) error: log_operation: Operation >>>>>>> 'reboot' [11331] (call 2 from crmd.17227) for host >>>>>>> 'dev-cluster2-node2.unix.tensor.ru' with device 'st1' returned: -62 >>>>>>> (Timer expired) >>>>>>> >>>>>>> On 14 Jan 2014, at 7:18 am, Andrew Beekhof <and...@beekhof.net> >>>>>>> wrote: >>>>>>>> On 13 Jan 2014, at 8:31 pm, Andrey Groshev <gre...@yandex.ru> >>>>>>>> wrote: >>>>>>>>> 13.01.2014, 02:51, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>>> On 10 Jan 2014, at 9:55 pm, Andrey Groshev <gre...@yandex.ru> >>>>>>>>>> wrote: >>>>>>>>>>> 10.01.2014, 14:31, "Andrey Groshev" <gre...@yandex.ru>: >>>>>>>>>>>> 10.01.2014, 14:01, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>>>>>> On 10 Jan 2014, at 5:03 pm, Andrey Groshev <gre...@yandex.ru> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> 10.01.2014, 05:29, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>>>>>>>> On 9 Jan 2014, at 11:11 pm, Andrey Groshev >>>>>>>>>>>>>>> <gre...@yandex.ru> wrote: >>>>>>>>>>>>>>>> 08.01.2014, 06:22, "Andrew Beekhof" >>>>>>>>>>>>>>>> <and...@beekhof.net>: >>>>>>>>>>>>>>>>> On 29 Nov 2013, at 7:17 pm, Andrey Groshev >>>>>>>>>>>>>>>>> <gre...@yandex.ru> wrote: >>>>>>>>>>>>>>>>>> Hi, ALL. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I'm still trying to cope with the fact that after >>>>>>>>>>>>>>>>>> the fence - node hangs in "pending". >>>>>>>>>>>>>>>>> Please define "pending". Where did you see this? >>>>>>>>>>>>>>>> In crm_mon: >>>>>>>>>>>>>>>> ...... >>>>>>>>>>>>>>>> Node dev-cluster2-node2 (172793105): pending >>>>>>>>>>>>>>>> ...... >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The experiment was like this: >>>>>>>>>>>>>>>> Four nodes in cluster. >>>>>>>>>>>>>>>> On one of them kill corosync or pacemakerd (signal 4 or >>>>>>>>>>>>>>>> 6 oк 11). >>>>>>>>>>>>>>>> Thereafter, the remaining start it constantly reboot, >>>>>>>>>>>>>>>> under various pretexts, "softly whistling", "fly low", "not a >>>>>>>>>>>>>>>> cluster member!" ... >>>>>>>>>>>>>>>> Then in the log fell out "Too many failures ...." >>>>>>>>>>>>>>>> All this time in the status in crm_mon is "pending". >>>>>>>>>>>>>>>> Depending on the wind direction changed to "UNCLEAN" >>>>>>>>>>>>>>>> Much time has passed and I can not accurately describe >>>>>>>>>>>>>>>> the behavior... >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Now I am in the following state: >>>>>>>>>>>>>>>> I tried locate the problem. Came here with this. >>>>>>>>>>>>>>>> I set big value in property stonith-timeout="600s". >>>>>>>>>>>>>>>> And got the following behavior: >>>>>>>>>>>>>>>> 1. pkill -4 corosync >>>>>>>>>>>>>>>> 2. from node with DC call my fence agent "sshbykey" >>>>>>>>>>>>>>>> 3. It sends reboot victim and waits until she comes to >>>>>>>>>>>>>>>> life again. >>>>>>>>>>>>>>> Hmmm.... what version of pacemaker? >>>>>>>>>>>>>>> This sounds like a timing issue that we fixed a while back >>>>>>>>>>>>>> Was a version 1.1.11 from December 3. >>>>>>>>>>>>>> Now try full update and retest. >>>>>>>>>>>>> That should be recent enough. Can you create a crm_report >>>>>>>>>>>>> the next time you reproduce? >>>>>>>>>>>> Of course yes. Little delay.... :) >>>>>>>>>>>> >>>>>>>>>>>> ...... >>>>>>>>>>>> cc1: warnings being treated as errors >>>>>>>>>>>> upstart.c: In function ‘upstart_job_property’: >>>>>>>>>>>> upstart.c:264: error: implicit declaration of function >>>>>>>>>>>> ‘g_variant_lookup_value’ >>>>>>>>>>>> upstart.c:264: error: nested extern declaration of >>>>>>>>>>>> ‘g_variant_lookup_value’ >>>>>>>>>>>> upstart.c:264: error: assignment makes pointer from integer >>>>>>>>>>>> without a cast >>>>>>>>>>>> gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1 >>>>>>>>>>>> gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services' >>>>>>>>>>>> make[1]: *** [all-recursive] Error 1 >>>>>>>>>>>> make[1]: Leaving directory `/root/ha/pacemaker/lib' >>>>>>>>>>>> make: *** [core] Error 1 >>>>>>>>>>>> >>>>>>>>>>>> I'm trying to solve this a problem. >>>>>>>>>>> Do not get solved quickly... >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value >>>>>>>>>>> g_variant_lookup_value () Since 2.28 >>>>>>>>>>> >>>>>>>>>>> # yum list installed glib2 >>>>>>>>>>> Loaded plugins: fastestmirror, rhnplugin, security >>>>>>>>>>> This system is receiving updates from RHN Classic or Red Hat >>>>>>>>>>> Satellite. >>>>>>>>>>> Loading mirror speeds from cached hostfile >>>>>>>>>>> Installed Packages >>>>>>>>>>> glib2.x86_64 >>>>>>>>>>> 2.26.1-3.el6 >>>>>>>>>>> installed >>>>>>>>>>> >>>>>>>>>>> # cat /etc/issue >>>>>>>>>>> CentOS release 6.5 (Final) >>>>>>>>>>> Kernel \r on an \m >>>>>>>>>> Can you try this patch? >>>>>>>>>> Upstart jobs wont work, but the code will compile >>>>>>>>>> >>>>>>>>>> diff --git a/lib/services/upstart.c b/lib/services/upstart.c >>>>>>>>>> index 831e7cf..195c3a4 100644 >>>>>>>>>> --- a/lib/services/upstart.c >>>>>>>>>> +++ b/lib/services/upstart.c >>>>>>>>>> @@ -231,12 +231,21 @@ upstart_job_exists(const char *name) >>>>>>>>>> static char * >>>>>>>>>> upstart_job_property(const char *obj, const gchar * iface, const >>>>>>>>>> char *name) >>>>>>>>>> { >>>>>>>>>> + char *output = NULL; >>>>>>>>>> + >>>>>>>>>> +#if !GLIB_CHECK_VERSION(2,28,0) >>>>>>>>>> + static bool err = TRUE; >>>>>>>>>> + >>>>>>>>>> + if(err) { >>>>>>>>>> + crm_err("This version of glib is too old to support >>>>>>>>>> upstart jobs"); >>>>>>>>>> + err = FALSE; >>>>>>>>>> + } >>>>>>>>>> +#else >>>>>>>>>> GError *error = NULL; >>>>>>>>>> GDBusProxy *proxy; >>>>>>>>>> GVariant *asv = NULL; >>>>>>>>>> GVariant *value = NULL; >>>>>>>>>> GVariant *_ret = NULL; >>>>>>>>>> - char *output = NULL; >>>>>>>>>> >>>>>>>>>> crm_info("Calling GetAll on %s", obj); >>>>>>>>>> proxy = get_proxy(obj, BUS_PROPERTY_IFACE); >>>>>>>>>> @@ -272,6 +281,7 @@ upstart_job_property(const char *obj, const >>>>>>>>>> gchar * iface, const char *name) >>>>>>>>>> >>>>>>>>>> g_object_unref(proxy); >>>>>>>>>> g_variant_unref(_ret); >>>>>>>>>> +#endif >>>>>>>>>> return output; >>>>>>>>>> } >>>>>>>>> Ok :) I patch source. >>>>>>>>> Type "make rc" - the same error. >>>>>>>> Because its not building your local changes >>>>>>>>> Make new copy via "fetch" - the same error. >>>>>>>>> It seems that if not exist >>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz, then download it. >>>>>>>>> Otherwise use exist archive. >>>>>>>>> Cutted log ....... >>>>>>>>> >>>>>>>>> # make rc >>>>>>>>> make TAG=Pacemaker-1.1.11-rc3 rpm >>>>>>>>> make[1]: Entering directory `/root/ha/pacemaker' >>>>>>>>> rm -f pacemaker-dirty.tar.* pacemaker-tip.tar.* >>>>>>>>> pacemaker-HEAD.tar.* >>>>>>>>> if [ ! -f ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz ]; >>>>>>>>> then \ >>>>>>>>> rm -f pacemaker.tar.*; >>>>>>>>> \ >>>>>>>>> if [ Pacemaker-1.1.11-rc3 = dirty ]; then >>>>>>>>> \ >>>>>>>>> git commit -m "DO-NOT-PUSH" -a; >>>>>>>>> \ >>>>>>>>> git archive >>>>>>>>> --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ HEAD | gzip > >>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \ >>>>>>>>> git reset --mixed HEAD^; >>>>>>>>> \ >>>>>>>>> else >>>>>>>>> \ >>>>>>>>> git archive >>>>>>>>> --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ >>>>>>>>> Pacemaker-1.1.11-rc3 | gzip > >>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \ >>>>>>>>> fi; >>>>>>>>> \ >>>>>>>>> echo `date`: Rebuilt >>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; >>>>>>>>> \ >>>>>>>>> else >>>>>>>>> \ >>>>>>>>> echo `date`: Using existing tarball: >>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; >>>>>>>>> \ >>>>>>>>> fi >>>>>>>>> Mon Jan 13 13:23:21 MSK 2014: Using existing tarball: >>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz >>>>>>>>> ....... >>>>>>>>> >>>>>>>>> Well, "make rpm" - build rpms and I create cluster. >>>>>>>>> I spent the same tests and confirmed the behavior. >>>>>>>>> crm_reoprt log here - http://send2me.ru/crmrep.tar.bz2 >>>>>>>> Thanks! >>>>>> , >>>>>> _______________________________________________ >>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>> >>>>>> Project Home: http://www.clusterlabs.org >>>>>> Getting started: >>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>> Bugs: http://bugs.clusterlabs.org >>>>> _______________________________________________ >>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org >>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: http://bugs.clusterlabs.org >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> , >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > , > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org