Re: [Pacemaker] hangs pending

Andrew Beekhof Mon, 13 Jan 2014 19:48:15 -0800

Ok, here's what happens:

1. node2 is lost
2. fencing of node2 starts
3. node2 reboots (and cluster starts)
4. node2 returns to the membership
5. node2 is marked as a cluster member
6. DC tries to bring it into the cluster, but needs to cancel the active 
transition first.
   Which is a problem since the node2 fencing operation is part of that
7. node2 is in a transition (pending) state until fencing passes or fails
8a. fencing fails: transition completes and the node joins the cluster


Thats in theory, except we automatically try again. Which isn't appropriate.
This should be relatively easy to fix.

8b. fencing passes: the node is incorrectly marked as offline

This I have no idea how to fix yet.


On another note, it doesn't look like this agent works at all.
The node has been back online for a long time and the agent is still timing out 
after 10 minutes.
So "Once the script makes sure that the victim will rebooted and again 
available via ssh - it exit with 0." does not seem true.

On 14 Jan 2014, at 1:19 pm, Andrew Beekhof <and...@beekhof.net> wrote:

> Apart from anything else, your timeout needs to be bigger:
> 
> Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: (  
> commands.c:1321  )   error: log_operation:   Operation 'reboot' [11331] (call 
> 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device 
> 'st1' returned: -62 (Timer expired)
> 
> 
> On 14 Jan 2014, at 7:18 am, Andrew Beekhof <and...@beekhof.net> wrote:
> 
>> 
>> On 13 Jan 2014, at 8:31 pm, Andrey Groshev <gre...@yandex.ru> wrote:
>> 
>>> 
>>> 
>>> 13.01.2014, 02:51, "Andrew Beekhof" <and...@beekhof.net>:
>>>> On 10 Jan 2014, at 9:55 pm, Andrey Groshev <gre...@yandex.ru> wrote:
>>>> 
>>>>> 10.01.2014, 14:31, "Andrey Groshev" <gre...@yandex.ru>:
>>>>>> 10.01.2014, 14:01, "Andrew Beekhof" <and...@beekhof.net>:
>>>>>>> On 10 Jan 2014, at 5:03 pm, Andrey Groshev <gre...@yandex.ru> wrote:
>>>>>>>>  10.01.2014, 05:29, "Andrew Beekhof" <and...@beekhof.net>:
>>>>>>>>>   On 9 Jan 2014, at 11:11 pm, Andrey Groshev <gre...@yandex.ru> wrote:
>>>>>>>>>>    08.01.2014, 06:22, "Andrew Beekhof" <and...@beekhof.net>:
>>>>>>>>>>>    On 29 Nov 2013, at 7:17 pm, Andrey Groshev <gre...@yandex.ru> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>>     Hi, ALL.
>>>>>>>>>>>> 
>>>>>>>>>>>>     I'm still trying to cope with the fact that after the fence - 
>>>>>>>>>>>> node hangs in "pending".
>>>>>>>>>>>    Please define "pending".  Where did you see this?
>>>>>>>>>>    In crm_mon:
>>>>>>>>>>    ......
>>>>>>>>>>    Node dev-cluster2-node2 (172793105): pending
>>>>>>>>>>    ......
>>>>>>>>>> 
>>>>>>>>>>    The experiment was like this:
>>>>>>>>>>    Four nodes in cluster.
>>>>>>>>>>    On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
>>>>>>>>>>    Thereafter, the remaining start it constantly reboot, under 
>>>>>>>>>> various pretexts, "softly whistling", "fly low", "not a cluster 
>>>>>>>>>> member!" ...
>>>>>>>>>>    Then in the log fell out "Too many failures ...."
>>>>>>>>>>    All this time in the status in crm_mon is "pending".
>>>>>>>>>>    Depending on the wind direction changed to "UNCLEAN"
>>>>>>>>>>    Much time has passed and I can not accurately describe the 
>>>>>>>>>> behavior...
>>>>>>>>>> 
>>>>>>>>>>    Now I am in the following state:
>>>>>>>>>>    I tried locate the problem. Came here with this.
>>>>>>>>>>    I set big value in property stonith-timeout="600s".
>>>>>>>>>>    And got the following behavior:
>>>>>>>>>>    1. pkill -4 corosync
>>>>>>>>>>    2. from node with DC call my fence agent "sshbykey"
>>>>>>>>>>    3. It sends reboot victim and waits until she comes to life again.
>>>>>>>>>   Hmmm.... what version of pacemaker?
>>>>>>>>>   This sounds like a timing issue that we fixed a while back
>>>>>>>>  Was a version 1.1.11 from December 3.
>>>>>>>>  Now try full update and retest.
>>>>>>> That should be recent enough.  Can you create a crm_report the next 
>>>>>>> time you reproduce?
>>>>>> Of course yes. Little delay.... :)
>>>>>> 
>>>>>> ......
>>>>>> cc1: warnings being treated as errors
>>>>>> upstart.c: In function ‘upstart_job_property’:
>>>>>> upstart.c:264: error: implicit declaration of function 
>>>>>> ‘g_variant_lookup_value’
>>>>>> upstart.c:264: error: nested extern declaration of 
>>>>>> ‘g_variant_lookup_value’
>>>>>> upstart.c:264: error: assignment makes pointer from integer without a 
>>>>>> cast
>>>>>> gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
>>>>>> gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
>>>>>> make[1]: *** [all-recursive] Error 1
>>>>>> make[1]: Leaving directory `/root/ha/pacemaker/lib'
>>>>>> make: *** [core] Error 1
>>>>>> 
>>>>>> I'm trying to solve this a problem.
>>>>> Do not get solved quickly...
>>>>> 
>>>>> https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value
>>>>> g_variant_lookup_value () Since 2.28
>>>>> 
>>>>> # yum list installed glib2
>>>>> Loaded plugins: fastestmirror, rhnplugin, security
>>>>> This system is receiving updates from RHN Classic or Red Hat Satellite.
>>>>> Loading mirror speeds from cached hostfile
>>>>> Installed Packages
>>>>> glib2.x86_64                                                              
>>>>> 2.26.1-3.el6                                                              
>>>>>  installed
>>>>> 
>>>>> # cat /etc/issue
>>>>> CentOS release 6.5 (Final)
>>>>> Kernel \r on an \m
>>>> 
>>>> Can you try this patch?
>>>> Upstart jobs wont work, but the code will compile
>>>> 
>>>> diff --git a/lib/services/upstart.c b/lib/services/upstart.c
>>>> index 831e7cf..195c3a4 100644
>>>> --- a/lib/services/upstart.c
>>>> +++ b/lib/services/upstart.c
>>>> @@ -231,12 +231,21 @@ upstart_job_exists(const char *name)
>>>> static char *
>>>> upstart_job_property(const char *obj, const gchar * iface, const char 
>>>> *name)
>>>> {
>>>> +    char *output = NULL;
>>>> +
>>>> +#if !GLIB_CHECK_VERSION(2,28,0)
>>>> +    static bool err = TRUE;
>>>> +
>>>> +    if(err) {
>>>> +        crm_err("This version of glib is too old to support upstart 
>>>> jobs");
>>>> +        err = FALSE;
>>>> +    }
>>>> +#else
>>>>    GError *error = NULL;
>>>>    GDBusProxy *proxy;
>>>>    GVariant *asv = NULL;
>>>>    GVariant *value = NULL;
>>>>    GVariant *_ret = NULL;
>>>> -    char *output = NULL;
>>>> 
>>>>    crm_info("Calling GetAll on %s", obj);
>>>>    proxy = get_proxy(obj, BUS_PROPERTY_IFACE);
>>>> @@ -272,6 +281,7 @@ upstart_job_property(const char *obj, const gchar * 
>>>> iface, const char *name)
>>>> 
>>>>    g_object_unref(proxy);
>>>>    g_variant_unref(_ret);
>>>> +#endif
>>>>    return output;
>>>> }
>>>> 
>>> 
>>> Ok :) I patch source. 
>>> Type "make rc" - the same error.
>> 
>> Because its not building your local changes
>> 
>>> Make new copy via "fetch" - the same error.
>>> It seems that if not exist 
>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz, then download it. 
>>> Otherwise use exist archive.
>>> Cutted log .......
>>> 
>>> # make rc
>>> make TAG=Pacemaker-1.1.11-rc3 rpm
>>> make[1]: Entering directory `/root/ha/pacemaker'
>>> rm -f pacemaker-dirty.tar.* pacemaker-tip.tar.* pacemaker-HEAD.tar.*
>>> if [ ! -f ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz ]; then         
>>>                                     \
>>>          rm -f pacemaker.tar.*;                                             
>>>  \
>>>          if [ Pacemaker-1.1.11-rc3 = dirty ]; then                          
>>>          \
>>>              git commit -m "DO-NOT-PUSH" -a;                                
>>>  \
>>>              git archive 
>>> --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ HEAD | gzip > 
>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;       \
>>>              git reset --mixed HEAD^;                                       
>>>  \
>>>          else                                                               
>>>  \
>>>              git archive 
>>> --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ Pacemaker-1.1.11-rc3 | 
>>> gzip > ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;     \
>>>          fi;                                                                
>>>  \
>>>          echo `date`: Rebuilt 
>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;                          
>>>            \
>>>      else                                                                   
>>>  \
>>>          echo `date`: Using existing tarball: 
>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;                     \
>>>      fi
>>> Mon Jan 13 13:23:21 MSK 2014: Using existing tarball: 
>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz
>>> .......
>>> 
>>> Well, "make rpm" - build rpms and I create cluster.
>>> I spent the same tests and confirmed the behavior.
>>> crm_reoprt log here - http://send2me.ru/crmrep.tar.bz2
>> 
>> Thanks!
>

signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] hangs pending

Reply via email to