On 16 Jan 2014, at 12:41 am, Andrey Groshev <gre...@yandex.ru> wrote:
> > > 15.01.2014, 02:53, "Andrew Beekhof" <and...@beekhof.net>: >> On 15 Jan 2014, at 12:15 am, Andrey Groshev <gre...@yandex.ru> wrote: >> >>> 14.01.2014, 10:00, "Andrey Groshev" <gre...@yandex.ru>: >>>> 14.01.2014, 07:47, "Andrew Beekhof" <and...@beekhof.net>: >>>>> Ok, here's what happens: >>>>> >>>>> 1. node2 is lost >>>>> 2. fencing of node2 starts >>>>> 3. node2 reboots (and cluster starts) >>>>> 4. node2 returns to the membership >>>>> 5. node2 is marked as a cluster member >>>>> 6. DC tries to bring it into the cluster, but needs to cancel the >>>>> active transition first. >>>>> Which is a problem since the node2 fencing operation is part of that >>>>> 7. node2 is in a transition (pending) state until fencing passes or >>>>> fails >>>>> 8a. fencing fails: transition completes and the node joins the cluster >>>>> >>>>> Thats in theory, except we automatically try again. Which isn't >>>>> appropriate. >>>>> This should be relatively easy to fix. >>>>> >>>>> 8b. fencing passes: the node is incorrectly marked as offline >>>>> >>>>> This I have no idea how to fix yet. >>>>> >>>>> On another note, it doesn't look like this agent works at all. >>>>> The node has been back online for a long time and the agent is still >>>>> timing out after 10 minutes. >>>>> So "Once the script makes sure that the victim will rebooted and again >>>>> available via ssh - it exit with 0." does not seem true. >>>> Damn. Looks like you're right. At some time I broke my agent and had not >>>> noticed it. Who will understand. >>> I repaired my agent - after send reboot he is wait STDIN. >>> Returned "normally" a behavior - hangs "pending", until manually send >>> reboot. :) >> >> Right. Now you're in case 8b. >> >> Can you try this patch: http://paste.fedoraproject.org/68450/38973966 > > > Killed all day experiences. > It turns out here that: > 1. Did cluster. > 2. On the node-2 send signal (-4) - killed corosink > 3. From node-1 (there DC) - stonith sent reboot > 4. Noda rebooted and resources start. > 5. Again. On the node-2 send signal (-4) - killed corosink > 6. Again. From node-1 (there DC) - stonith sent reboot > 7. Noda-2 rebooted and hangs in "pending" > 8. Waiting, waiting..... manually reboot. > 9. Noda-2 reboot and raised resources start. > 10. GOTO p.2 Logs? > > > >>> New logs: http://send2me.ru/crmrep1.tar.bz2 >>>>> On 14 Jan 2014, at 1:19 pm, Andrew Beekhof <and...@beekhof.net> wrote: >>>>>> Apart from anything else, your timeout needs to be bigger: >>>>>> >>>>>> Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: >>>>>> ( commands.c:1321 ) error: log_operation: Operation 'reboot' [11331] >>>>>> (call 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' >>>>>> with device 'st1' returned: -62 (Timer expired) >>>>>> >>>>>> On 14 Jan 2014, at 7:18 am, Andrew Beekhof <and...@beekhof.net> wrote: >>>>>>> On 13 Jan 2014, at 8:31 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>>>>>>> 13.01.2014, 02:51, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>> On 10 Jan 2014, at 9:55 pm, Andrey Groshev <gre...@yandex.ru> >>>>>>>>> wrote: >>>>>>>>>> 10.01.2014, 14:31, "Andrey Groshev" <gre...@yandex.ru>: >>>>>>>>>>> 10.01.2014, 14:01, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>>>>> On 10 Jan 2014, at 5:03 pm, Andrey Groshev <gre...@yandex.ru> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> 10.01.2014, 05:29, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>>>>>>> On 9 Jan 2014, at 11:11 pm, Andrey Groshev >>>>>>>>>>>>>> <gre...@yandex.ru> wrote: >>>>>>>>>>>>>>> 08.01.2014, 06:22, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>>>>>>>>> On 29 Nov 2013, at 7:17 pm, Andrey Groshev >>>>>>>>>>>>>>>> <gre...@yandex.ru> wrote: >>>>>>>>>>>>>>>>> Hi, ALL. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I'm still trying to cope with the fact that after the >>>>>>>>>>>>>>>>> fence - node hangs in "pending". >>>>>>>>>>>>>>>> Please define "pending". Where did you see this? >>>>>>>>>>>>>>> In crm_mon: >>>>>>>>>>>>>>> ...... >>>>>>>>>>>>>>> Node dev-cluster2-node2 (172793105): pending >>>>>>>>>>>>>>> ...... >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The experiment was like this: >>>>>>>>>>>>>>> Four nodes in cluster. >>>>>>>>>>>>>>> On one of them kill corosync or pacemakerd (signal 4 or 6 >>>>>>>>>>>>>>> oк 11). >>>>>>>>>>>>>>> Thereafter, the remaining start it constantly reboot, >>>>>>>>>>>>>>> under various pretexts, "softly whistling", "fly low", "not a >>>>>>>>>>>>>>> cluster member!" ... >>>>>>>>>>>>>>> Then in the log fell out "Too many failures ...." >>>>>>>>>>>>>>> All this time in the status in crm_mon is "pending". >>>>>>>>>>>>>>> Depending on the wind direction changed to "UNCLEAN" >>>>>>>>>>>>>>> Much time has passed and I can not accurately describe >>>>>>>>>>>>>>> the behavior... >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Now I am in the following state: >>>>>>>>>>>>>>> I tried locate the problem. Came here with this. >>>>>>>>>>>>>>> I set big value in property stonith-timeout="600s". >>>>>>>>>>>>>>> And got the following behavior: >>>>>>>>>>>>>>> 1. pkill -4 corosync >>>>>>>>>>>>>>> 2. from node with DC call my fence agent "sshbykey" >>>>>>>>>>>>>>> 3. It sends reboot victim and waits until she comes to >>>>>>>>>>>>>>> life again. >>>>>>>>>>>>>> Hmmm.... what version of pacemaker? >>>>>>>>>>>>>> This sounds like a timing issue that we fixed a while back >>>>>>>>>>>>> Was a version 1.1.11 from December 3. >>>>>>>>>>>>> Now try full update and retest. >>>>>>>>>>>> That should be recent enough. Can you create a crm_report the >>>>>>>>>>>> next time you reproduce? >>>>>>>>>>> Of course yes. Little delay.... :) >>>>>>>>>>> >>>>>>>>>>> ...... >>>>>>>>>>> cc1: warnings being treated as errors >>>>>>>>>>> upstart.c: In function ‘upstart_job_property’: >>>>>>>>>>> upstart.c:264: error: implicit declaration of function >>>>>>>>>>> ‘g_variant_lookup_value’ >>>>>>>>>>> upstart.c:264: error: nested extern declaration of >>>>>>>>>>> ‘g_variant_lookup_value’ >>>>>>>>>>> upstart.c:264: error: assignment makes pointer from integer >>>>>>>>>>> without a cast >>>>>>>>>>> gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1 >>>>>>>>>>> gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services' >>>>>>>>>>> make[1]: *** [all-recursive] Error 1 >>>>>>>>>>> make[1]: Leaving directory `/root/ha/pacemaker/lib' >>>>>>>>>>> make: *** [core] Error 1 >>>>>>>>>>> >>>>>>>>>>> I'm trying to solve this a problem. >>>>>>>>>> Do not get solved quickly... >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value >>>>>>>>>> g_variant_lookup_value () Since 2.28 >>>>>>>>>> >>>>>>>>>> # yum list installed glib2 >>>>>>>>>> Loaded plugins: fastestmirror, rhnplugin, security >>>>>>>>>> This system is receiving updates from RHN Classic or Red Hat >>>>>>>>>> Satellite. >>>>>>>>>> Loading mirror speeds from cached hostfile >>>>>>>>>> Installed Packages >>>>>>>>>> glib2.x86_64 >>>>>>>>>> 2.26.1-3.el6 >>>>>>>>>> installed >>>>>>>>>> >>>>>>>>>> # cat /etc/issue >>>>>>>>>> CentOS release 6.5 (Final) >>>>>>>>>> Kernel \r on an \m >>>>>>>>> Can you try this patch? >>>>>>>>> Upstart jobs wont work, but the code will compile >>>>>>>>> >>>>>>>>> diff --git a/lib/services/upstart.c b/lib/services/upstart.c >>>>>>>>> index 831e7cf..195c3a4 100644 >>>>>>>>> --- a/lib/services/upstart.c >>>>>>>>> +++ b/lib/services/upstart.c >>>>>>>>> @@ -231,12 +231,21 @@ upstart_job_exists(const char *name) >>>>>>>>> static char * >>>>>>>>> upstart_job_property(const char *obj, const gchar * iface, const >>>>>>>>> char *name) >>>>>>>>> { >>>>>>>>> + char *output = NULL; >>>>>>>>> + >>>>>>>>> +#if !GLIB_CHECK_VERSION(2,28,0) >>>>>>>>> + static bool err = TRUE; >>>>>>>>> + >>>>>>>>> + if(err) { >>>>>>>>> + crm_err("This version of glib is too old to support >>>>>>>>> upstart jobs"); >>>>>>>>> + err = FALSE; >>>>>>>>> + } >>>>>>>>> +#else >>>>>>>>> GError *error = NULL; >>>>>>>>> GDBusProxy *proxy; >>>>>>>>> GVariant *asv = NULL; >>>>>>>>> GVariant *value = NULL; >>>>>>>>> GVariant *_ret = NULL; >>>>>>>>> - char *output = NULL; >>>>>>>>> >>>>>>>>> crm_info("Calling GetAll on %s", obj); >>>>>>>>> proxy = get_proxy(obj, BUS_PROPERTY_IFACE); >>>>>>>>> @@ -272,6 +281,7 @@ upstart_job_property(const char *obj, const >>>>>>>>> gchar * iface, const char *name) >>>>>>>>> >>>>>>>>> g_object_unref(proxy); >>>>>>>>> g_variant_unref(_ret); >>>>>>>>> +#endif >>>>>>>>> return output; >>>>>>>>> } >>>>>>>> Ok :) I patch source. >>>>>>>> Type "make rc" - the same error. >>>>>>> Because its not building your local changes >>>>>>>> Make new copy via "fetch" - the same error. >>>>>>>> It seems that if not exist >>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz, then download it. >>>>>>>> Otherwise use exist archive. >>>>>>>> Cutted log ....... >>>>>>>> >>>>>>>> # make rc >>>>>>>> make TAG=Pacemaker-1.1.11-rc3 rpm >>>>>>>> make[1]: Entering directory `/root/ha/pacemaker' >>>>>>>> rm -f pacemaker-dirty.tar.* pacemaker-tip.tar.* pacemaker-HEAD.tar.* >>>>>>>> if [ ! -f ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz ]; then >>>>>>>> \ >>>>>>>> rm -f pacemaker.tar.*; >>>>>>>> \ >>>>>>>> if [ Pacemaker-1.1.11-rc3 = dirty ]; then >>>>>>>> \ >>>>>>>> git commit -m "DO-NOT-PUSH" -a; >>>>>>>> \ >>>>>>>> git archive >>>>>>>> --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ HEAD | gzip > >>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \ >>>>>>>> git reset --mixed HEAD^; >>>>>>>> \ >>>>>>>> else >>>>>>>> \ >>>>>>>> git archive >>>>>>>> --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ >>>>>>>> Pacemaker-1.1.11-rc3 | gzip > >>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \ >>>>>>>> fi; >>>>>>>> \ >>>>>>>> echo `date`: Rebuilt >>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; >>>>>>>> \ >>>>>>>> else >>>>>>>> \ >>>>>>>> echo `date`: Using existing tarball: >>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; >>>>>>>> \ >>>>>>>> fi >>>>>>>> Mon Jan 13 13:23:21 MSK 2014: Using existing tarball: >>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz >>>>>>>> ....... >>>>>>>> >>>>>>>> Well, "make rpm" - build rpms and I create cluster. >>>>>>>> I spent the same tests and confirmed the behavior. >>>>>>>> crm_reoprt log here - http://send2me.ru/crmrep.tar.bz2 >>>>>>> Thanks! >>>>> , >>>>> _______________________________________________ >>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org >>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: http://bugs.clusterlabs.org >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> >> , >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org