Re: [Pacemaker] hangs pending

Andrey Groshev Thu, 16 Jan 2014 00:30:22 -0800


16.01.2014, 01:30, "Andrew Beekhof" <and...@beekhof.net>:
> On 16 Jan 2014, at 12:41 am, Andrey Groshev <gre...@yandex.ru> wrote:
>
>>  15.01.2014, 02:53, "Andrew Beekhof" <and...@beekhof.net>:
>>>  On 15 Jan 2014, at 12:15 am, Andrey Groshev <gre...@yandex.ru> wrote:
>>>>   14.01.2014, 10:00, "Andrey Groshev" <gre...@yandex.ru>:
>>>>>   14.01.2014, 07:47, "Andrew Beekhof" <and...@beekhof.net>:
>>>>>>    Ok, here's what happens:
>>>>>>
>>>>>>    1. node2 is lost
>>>>>>    2. fencing of node2 starts
>>>>>>    3. node2 reboots (and cluster starts)
>>>>>>    4. node2 returns to the membership
>>>>>>    5. node2 is marked as a cluster member
>>>>>>    6. DC tries to bring it into the cluster, but needs to cancel the 
>>>>>> active transition first.
>>>>>>       Which is a problem since the node2 fencing operation is part of 
>>>>>> that
>>>>>>    7. node2 is in a transition (pending) state until fencing passes or 
>>>>>> fails
>>>>>>    8a. fencing fails: transition completes and the node joins the cluster
>>>>>>
>>>>>>    Thats in theory, except we automatically try again. Which isn't 
>>>>>> appropriate.
>>>>>>    This should be relatively easy to fix.
>>>>>>
>>>>>>    8b. fencing passes: the node is incorrectly marked as offline
>>>>>>
>>>>>>    This I have no idea how to fix yet.
>>>>>>
>>>>>>    On another note, it doesn't look like this agent works at all.
>>>>>>    The node has been back online for a long time and the agent is still 
>>>>>> timing out after 10 minutes.
>>>>>>    So "Once the script makes sure that the victim will rebooted and 
>>>>>> again available via ssh - it exit with 0." does not seem true.
>>>>>   Damn. Looks like you're right. At some time I broke my agent and had 
>>>>> not noticed it. Who will understand.
>>>>   I repaired my agent - after send reboot he is wait STDIN.
>>>>   Returned "normally" a behavior - hangs "pending", until manually send 
>>>> reboot. :)
>>>  Right. Now you're in case 8b.
>>>
>>>  Can you try this patch:  http://paste.fedoraproject.org/68450/38973966
>>  Killed all day experiences.
>>  It turns out here that:
>>  1. Did cluster.
>>  2. On the node-2 send signal (-4) - killed corosink
>>  3. From node-1 (there DC) - stonith sent reboot
>>  4. Noda rebooted and resources start.
>>  5. Again. On the node-2 send signal (-4) - killed corosink
>>  6. Again. From node-1 (there DC) - stonith sent reboot
>>  7. Noda-2 rebooted and hangs in "pending"
>>  8. Waiting, waiting..... manually reboot.
>>  9. Noda-2 reboot and raised resources start.
>>  10. GOTO p.2
>
> Logs?


Yesterday I wrote an additional letter why not put the logs. 
Read it please, it contains a few more questions.
Today again began to hang and continue along the same cycle.
Logs here http://send2me.ru/crmrep2.tar.bz2

>>>>   New logs: http://send2me.ru/crmrep1.tar.bz2
>>>>>>    On 14 Jan 2014, at 1:19 pm, Andrew Beekhof <and...@beekhof.net> wrote:
>>>>>>>     Apart from anything else, your timeout needs to be bigger:
>>>>>>>
>>>>>>>     Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru 
>>>>>>> stonith-ng: (  commands.c:1321  )   error: log_operation: Operation 
>>>>>>> 'reboot' [11331] (call 2 from crmd.17227) for host 
>>>>>>> 'dev-cluster2-node2.unix.tensor.ru' with device 'st1' returned: -62 
>>>>>>> (Timer expired)
>>>>>>>
>>>>>>>     On 14 Jan 2014, at 7:18 am, Andrew Beekhof <and...@beekhof.net> 
>>>>>>> wrote:
>>>>>>>>     On 13 Jan 2014, at 8:31 pm, Andrey Groshev <gre...@yandex.ru> 
>>>>>>>> wrote:
>>>>>>>>>     13.01.2014, 02:51, "Andrew Beekhof" <and...@beekhof.net>:
>>>>>>>>>>     On 10 Jan 2014, at 9:55 pm, Andrey Groshev <gre...@yandex.ru> 
>>>>>>>>>> wrote:
>>>>>>>>>>>     10.01.2014, 14:31, "Andrey Groshev" <gre...@yandex.ru>:
>>>>>>>>>>>>     10.01.2014, 14:01, "Andrew Beekhof" <and...@beekhof.net>:
>>>>>>>>>>>>>     On 10 Jan 2014, at 5:03 pm, Andrey Groshev <gre...@yandex.ru> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>      10.01.2014, 05:29, "Andrew Beekhof" <and...@beekhof.net>:
>>>>>>>>>>>>>>>       On 9 Jan 2014, at 11:11 pm, Andrey Groshev 
>>>>>>>>>>>>>>> <gre...@yandex.ru> wrote:
>>>>>>>>>>>>>>>>        08.01.2014, 06:22, "Andrew Beekhof" 
>>>>>>>>>>>>>>>> <and...@beekhof.net>:
>>>>>>>>>>>>>>>>>        On 29 Nov 2013, at 7:17 pm, Andrey Groshev 
>>>>>>>>>>>>>>>>> <gre...@yandex.ru> wrote:
>>>>>>>>>>>>>>>>>>         Hi, ALL.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>         I'm still trying to cope with the fact that after 
>>>>>>>>>>>>>>>>>> the fence - node hangs in "pending".
>>>>>>>>>>>>>>>>>        Please define "pending".  Where did you see this?
>>>>>>>>>>>>>>>>        In crm_mon:
>>>>>>>>>>>>>>>>        ......
>>>>>>>>>>>>>>>>        Node dev-cluster2-node2 (172793105): pending
>>>>>>>>>>>>>>>>        ......
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>        The experiment was like this:
>>>>>>>>>>>>>>>>        Four nodes in cluster.
>>>>>>>>>>>>>>>>        On one of them kill corosync or pacemakerd (signal 4 or 
>>>>>>>>>>>>>>>> 6 oк 11).
>>>>>>>>>>>>>>>>        Thereafter, the remaining start it constantly reboot, 
>>>>>>>>>>>>>>>> under various pretexts, "softly whistling", "fly low", "not a 
>>>>>>>>>>>>>>>> cluster member!" ...
>>>>>>>>>>>>>>>>        Then in the log fell out "Too many failures ...."
>>>>>>>>>>>>>>>>        All this time in the status in crm_mon is "pending".
>>>>>>>>>>>>>>>>        Depending on the wind direction changed to "UNCLEAN"
>>>>>>>>>>>>>>>>        Much time has passed and I can not accurately describe 
>>>>>>>>>>>>>>>> the behavior...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>        Now I am in the following state:
>>>>>>>>>>>>>>>>        I tried locate the problem. Came here with this.
>>>>>>>>>>>>>>>>        I set big value in property stonith-timeout="600s".
>>>>>>>>>>>>>>>>        And got the following behavior:
>>>>>>>>>>>>>>>>        1. pkill -4 corosync
>>>>>>>>>>>>>>>>        2. from node with DC call my fence agent "sshbykey"
>>>>>>>>>>>>>>>>        3. It sends reboot victim and waits until she comes to 
>>>>>>>>>>>>>>>> life again.
>>>>>>>>>>>>>>>       Hmmm.... what version of pacemaker?
>>>>>>>>>>>>>>>       This sounds like a timing issue that we fixed a while back
>>>>>>>>>>>>>>      Was a version 1.1.11 from December 3.
>>>>>>>>>>>>>>      Now try full update and retest.
>>>>>>>>>>>>>     That should be recent enough.  Can you create a crm_report 
>>>>>>>>>>>>> the next time you reproduce?
>>>>>>>>>>>>     Of course yes. Little delay.... :)
>>>>>>>>>>>>
>>>>>>>>>>>>     ......
>>>>>>>>>>>>     cc1: warnings being treated as errors
>>>>>>>>>>>>     upstart.c: In function ‘upstart_job_property’:
>>>>>>>>>>>>     upstart.c:264: error: implicit declaration of function 
>>>>>>>>>>>> ‘g_variant_lookup_value’
>>>>>>>>>>>>     upstart.c:264: error: nested extern declaration of 
>>>>>>>>>>>> ‘g_variant_lookup_value’
>>>>>>>>>>>>     upstart.c:264: error: assignment makes pointer from integer 
>>>>>>>>>>>> without a cast
>>>>>>>>>>>>     gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
>>>>>>>>>>>>     gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
>>>>>>>>>>>>     make[1]: *** [all-recursive] Error 1
>>>>>>>>>>>>     make[1]: Leaving directory `/root/ha/pacemaker/lib'
>>>>>>>>>>>>     make: *** [core] Error 1
>>>>>>>>>>>>
>>>>>>>>>>>>     I'm trying to solve this a problem.
>>>>>>>>>>>     Do not get solved quickly...
>>>>>>>>>>>
>>>>>>>>>>>     
>>>>>>>>>>> https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value
>>>>>>>>>>>     g_variant_lookup_value () Since 2.28
>>>>>>>>>>>
>>>>>>>>>>>     # yum list installed glib2
>>>>>>>>>>>     Loaded plugins: fastestmirror, rhnplugin, security
>>>>>>>>>>>     This system is receiving updates from RHN Classic or Red Hat 
>>>>>>>>>>> Satellite.
>>>>>>>>>>>     Loading mirror speeds from cached hostfile
>>>>>>>>>>>     Installed Packages
>>>>>>>>>>>     glib2.x86_64                                                    
>>>>>>>>>>>           2.26.1-3.el6                                              
>>>>>>>>>>>                  installed
>>>>>>>>>>>
>>>>>>>>>>>     # cat /etc/issue
>>>>>>>>>>>     CentOS release 6.5 (Final)
>>>>>>>>>>>     Kernel \r on an \m
>>>>>>>>>>     Can you try this patch?
>>>>>>>>>>     Upstart jobs wont work, but the code will compile
>>>>>>>>>>
>>>>>>>>>>     diff --git a/lib/services/upstart.c b/lib/services/upstart.c
>>>>>>>>>>     index 831e7cf..195c3a4 100644
>>>>>>>>>>     --- a/lib/services/upstart.c
>>>>>>>>>>     +++ b/lib/services/upstart.c
>>>>>>>>>>     @@ -231,12 +231,21 @@ upstart_job_exists(const char *name)
>>>>>>>>>>     static char *
>>>>>>>>>>     upstart_job_property(const char *obj, const gchar * iface, const 
>>>>>>>>>> char *name)
>>>>>>>>>>     {
>>>>>>>>>>     +    char *output = NULL;
>>>>>>>>>>     +
>>>>>>>>>>     +#if !GLIB_CHECK_VERSION(2,28,0)
>>>>>>>>>>     +    static bool err = TRUE;
>>>>>>>>>>     +
>>>>>>>>>>     +    if(err) {
>>>>>>>>>>     +        crm_err("This version of glib is too old to support 
>>>>>>>>>> upstart jobs");
>>>>>>>>>>     +        err = FALSE;
>>>>>>>>>>     +    }
>>>>>>>>>>     +#else
>>>>>>>>>>        GError *error = NULL;
>>>>>>>>>>        GDBusProxy *proxy;
>>>>>>>>>>        GVariant *asv = NULL;
>>>>>>>>>>        GVariant *value = NULL;
>>>>>>>>>>        GVariant *_ret = NULL;
>>>>>>>>>>     -    char *output = NULL;
>>>>>>>>>>
>>>>>>>>>>        crm_info("Calling GetAll on %s", obj);
>>>>>>>>>>        proxy = get_proxy(obj, BUS_PROPERTY_IFACE);
>>>>>>>>>>     @@ -272,6 +281,7 @@ upstart_job_property(const char *obj, const 
>>>>>>>>>> gchar * iface, const char *name)
>>>>>>>>>>
>>>>>>>>>>        g_object_unref(proxy);
>>>>>>>>>>        g_variant_unref(_ret);
>>>>>>>>>>     +#endif
>>>>>>>>>>        return output;
>>>>>>>>>>     }
>>>>>>>>>     Ok :) I patch source.
>>>>>>>>>     Type "make rc" - the same error.
>>>>>>>>     Because its not building your local changes
>>>>>>>>>     Make new copy via "fetch" - the same error.
>>>>>>>>>     It seems that if not exist 
>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz, then download it.
>>>>>>>>>     Otherwise use exist archive.
>>>>>>>>>     Cutted log .......
>>>>>>>>>
>>>>>>>>>     # make rc
>>>>>>>>>     make TAG=Pacemaker-1.1.11-rc3 rpm
>>>>>>>>>     make[1]: Entering directory `/root/ha/pacemaker'
>>>>>>>>>     rm -f pacemaker-dirty.tar.* pacemaker-tip.tar.* 
>>>>>>>>> pacemaker-HEAD.tar.*
>>>>>>>>>     if [ ! -f ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz ]; 
>>>>>>>>> then                                             \
>>>>>>>>>              rm -f pacemaker.tar.*;                                   
>>>>>>>>>            \
>>>>>>>>>              if [ Pacemaker-1.1.11-rc3 = dirty ]; then                
>>>>>>>>>                    \
>>>>>>>>>                  git commit -m "DO-NOT-PUSH" -a;                      
>>>>>>>>>            \
>>>>>>>>>                  git archive 
>>>>>>>>> --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ HEAD | gzip > 
>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;       \
>>>>>>>>>                  git reset --mixed HEAD^;                             
>>>>>>>>>            \
>>>>>>>>>              else                                                     
>>>>>>>>>            \
>>>>>>>>>                  git archive 
>>>>>>>>> --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ 
>>>>>>>>> Pacemaker-1.1.11-rc3 | gzip > 
>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;     \
>>>>>>>>>              fi;                                                      
>>>>>>>>>            \
>>>>>>>>>              echo `date`: Rebuilt 
>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;                    
>>>>>>>>>                  \
>>>>>>>>>          else                                                         
>>>>>>>>>            \
>>>>>>>>>              echo `date`: Using existing tarball: 
>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;                    
>>>>>>>>>  \
>>>>>>>>>          fi
>>>>>>>>>     Mon Jan 13 13:23:21 MSK 2014: Using existing tarball: 
>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz
>>>>>>>>>     .......
>>>>>>>>>
>>>>>>>>>     Well, "make rpm" - build rpms and I create cluster.
>>>>>>>>>     I spent the same tests and confirmed the behavior.
>>>>>>>>>     crm_reoprt log here - http://send2me.ru/crmrep.tar.bz2
>>>>>>>>     Thanks!
>>>>>>    ,
>>>>>>    _______________________________________________
>>>>>>    Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>>    http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>
>>>>>>    Project Home: http://www.clusterlabs.org
>>>>>>    Getting started: 
>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>    Bugs: http://bugs.clusterlabs.org
>>>>>   _______________________________________________
>>>>>   Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>>   Project Home: http://www.clusterlabs.org
>>>>>   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>   Bugs: http://bugs.clusterlabs.org
>>>>   _______________________________________________
>>>>   Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>>   Project Home: http://www.clusterlabs.org
>>>>   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>   Bugs: http://bugs.clusterlabs.org
>>>  ,
>>>  _______________________________________________
>>>  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>>  Project Home: http://www.clusterlabs.org
>>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>  Bugs: http://bugs.clusterlabs.org
>>  _______________________________________________
>>  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>>  Project Home: http://www.clusterlabs.org
>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>  Bugs: http://bugs.clusterlabs.org
>
> ,
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] hangs pending

Reply via email to