Re: [Pacemaker] hangs pending

Andrey Groshev Thu, 20 Feb 2014 03:11:21 -0800


20.02.2014, 13:57, "Andrew Beekhof" <[email protected]>:
> On 20 Feb 2014, at 5:33 pm, Andrey Groshev <[email protected]> wrote:
>
>>  20.02.2014, 01:22, "Andrew Beekhof" <[email protected]>:
>>>  On 20 Feb 2014, at 4:18 am, Andrey Groshev <[email protected]> wrote:
>>>>   19.02.2014, 06:47, "Andrew Beekhof" <[email protected]>:
>>>>>   On 18 Feb 2014, at 9:29 pm, Andrey Groshev <[email protected]> wrote:
>>>>>>    Hi, ALL and Andrew!
>>>>>>
>>>>>>    Today is a good day - I killed a lot, and a lot of shooting at me.
>>>>>>    In general - I am happy (almost like an elephant)   :)
>>>>>>    Except resources on the node are important to me eight processes: 
>>>>>> corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd.
>>>>>>    I killed them with different signals (4,6,11 and even 9).
>>>>>>    Behavior does not depend of number signal - it's good.
>>>>>>    If STONITH send reboot to the node - it rebooted and rejoined the 
>>>>>> cluster - too it's good.
>>>>>>    But the behavior is different from killing various demons.
>>>>>>
>>>>>>    Turned four groups:
>>>>>>    1. corosync,cib - STONITH work 100%.
>>>>>>    Kill via any signals - call STONITH and reboot.
>>>>>   excellent
>>>>>>    3. stonithd,attrd,pengine - not need STONITH
>>>>>>    This daemons simple restart, resources - stay running.
>>>>>   right
>>>>>>    2. lrmd,crmd - strange behavior STONITH.
>>>>>>    Sometimes called STONITH - and the corresponding reaction.
>>>>>>    Sometimes restart daemon
>>>>>   The daemon will always try to restart, the only variable is how long it 
>>>>> takes the peer to notice and initiate fencing.
>>>>>   If the failure happens just before a they're due to receive totem 
>>>>> token, the failure will be very quickly detected and the node fenced.
>>>>>   If the failure happens just after, then detection will take longer - 
>>>>> giving the node longer to recover and not be fenced.
>>>>>
>>>>>   So fence/not fence is normal and to be expected.
>>>>>>    and restart resources with large delay MS:pgsql.
>>>>>>    One time after restart crmd - pgsql don't restart.
>>>>>   I would not expect pgsql to ever restart - if the RA does its job 
>>>>> properly anyway.
>>>>>   In the case the node is not fenced, the crmd will respawn and the the 
>>>>> PE will request that it re-detect the state of all resources.
>>>>>
>>>>>   If the agent reports "all good", then there is nothing more to do.
>>>>>   If the agent is not reporting "all good", you should really be asking 
>>>>> why.
>>>>>>    4. pacemakerd - nothing happens.
>>>>>   On non-systemd based machines, correct.
>>>>>
>>>>>   On a systemd based machine pacemakerd is respawned and reattaches to 
>>>>> the existing daemons.
>>>>>   Any subsequent daemon failure will be detected and the daemon respawned.
>>>>   And! I almost forgot about IT!
>>>>   Exist another (NORMAL) the variants, the methods, the ideas?
>>>>   Without this  ... @$%#$%&$%^&$%^&##@#$$^$%& !!!!!
>>>>   Otherwise - it's a full epic fail ;)
>>>  -ENOPARSE
>>  OK, I remove my personal attitude to "systemd".
>>  Let me explain.
>>
>>  Somewhere in the beginning of this topic, I wrote:
>>  A.G.:Who knows who runs lrmd?
>>  A.B.:Pacemakerd.
>>  That's one!
>>
>>  Let's see the list of processes:
>>  #ps -axf
>>  .....
>>  6067 ?        Ssl    7:24 corosync
>>  6092 ?        S      0:25 pacemakerd
>>  6094 ?        Ss   116:13  \_ /usr/libexec/pacemaker/cib
>>  6095 ?        Ss     0:25  \_ /usr/libexec/pacemaker/stonithd
>>  6096 ?        Ss     1:27  \_ /usr/libexec/pacemaker/lrmd
>>  6097 ?        Ss     0:49  \_ /usr/libexec/pacemaker/attrd
>>  6098 ?        Ss     0:25  \_ /usr/libexec/pacemaker/pengine
>>  6099 ?        Ss     0:29  \_ /usr/libexec/pacemaker/crmd
>>  .....
>>  That's two!
>
> Whats two?  I don't follow.
In the sense that it creates other processes. But it does not matter.



>>  And more, more...
>>  Now you must understand - why I want this process to work always.
>>  Even I think, No need for anyone here to explain it!
>>
>>  And Now you say about "pacemakerd nice work, but only on systemd distros" 
>> !!!
>
> No, I;m saying it works _better_ on systemd distros.
> On non-systemd distros you still need quite a few unlikely-to-happen failures 
> to trigger a situation in which the node still gets fenced and recovered 
> (assuming no-one saw any of the error messages and didn't run "service 
> pacemaker restart" prior to the additional failures).
>
Can you show me the place where:
"On a systemd based machine pacemakerd is respawned and reattaches to the 
existing daemons."?
If I respawn via upstart process pacemakerd - "reattaches to the existing 
daemons" ?

>>  What should I do now?
>>  * Integrate systemd in CentOS?
>>  * Migrate to Fefora?
>>  * Buy RHEL7 !?
>
> Option 3 is particularly good :)

It's too easy. Normal heroes are always going to bypass :) 

>>  Each a variants is great, but don't fit for me.
>>
>>  P.S. And I'm not talking distros which don't migrate to systemd (and will 
>> not do).
>
> Are there any?  Even debian and ubuntu have raised the white flag.

It certainly a lyrics, but potentially it can be any Unix-like system.


>>  Do not be offended! We also do so.
>>  We are building a secret military factory,
>>  large concrete fence around it,
>>  wall barbed wire, but forget to install the gates. :)
>>>>>>    And then I can kill any process of the third group. They do not 
>>>>>> restart.
>>>>>   Until they become needed.
>>>>>   Eg. if the DC goes to invoke the policy engine, that will fail causing 
>>>>> the crmd to fail and the node to be fenced.
>>>>>>    Generaly don't touch corosync,cib and maybe lrmd,crmd.
>>>>>>
>>>>>>    What do you think about this?
>>>>>>    The main question of this topic - we decided.
>>>>>>    But this varied behavior - another big problem.
>>>>>>
>>>>>>    17.02.2014, 08:52, "Andrey Groshev" <[email protected]>:
>>>>>>>    17.02.2014, 02:27, "Andrew Beekhof" <[email protected]>:
>>>>>>>>     With no quick follow-up, dare one hope that means the patch 
>>>>>>>> worked? :-)
>>>>>>>    Hi,
>>>>>>>    No, unfortunately the chief changed my plans on Friday and all day I 
>>>>>>> was engaged in a parallel project.
>>>>>>>    I hope that today have time to carry out the necessary tests.
>>>>>>>>     On 14 Feb 2014, at 3:37 pm, Andrey Groshev <[email protected]> 
>>>>>>>> wrote:
>>>>>>>>>      Yes, of course. Now beginning build world and test )
>>>>>>>>>
>>>>>>>>>      14.02.2014, 04:41, "Andrew Beekhof" <[email protected]>:
>>>>>>>>>>      The previous patch wasn't quite right.
>>>>>>>>>>      Could you try this new one?
>>>>>>>>>>
>>>>>>>>>>         http://paste.fedoraproject.org/77123/13923376/
>>>>>>>>>>
>>>>>>>>>>      [11:23 AM] beekhof@f19 ~/Development/sources/pacemaker/devel ☺ 
>>>>>>>>>> # git diff
>>>>>>>>>>      diff --git a/crmd/callbacks.c b/crmd/callbacks.c
>>>>>>>>>>      index ac4b905..d49525b 100644
>>>>>>>>>>      --- a/crmd/callbacks.c
>>>>>>>>>>      +++ b/crmd/callbacks.c
>>>>>>>>>>      @@ -199,8 +199,7 @@ peer_update_callback(enum crm_status_type 
>>>>>>>>>> type, crm_node_t * node, const void *d
>>>>>>>>>>                       stop_te_timer(down->timer);
>>>>>>>>>>
>>>>>>>>>>                       flags |= node_update_join | 
>>>>>>>>>> node_update_expected;
>>>>>>>>>>      -                crm_update_peer_join(__FUNCTION__, node, 
>>>>>>>>>> crm_join_none);
>>>>>>>>>>      -                crm_update_peer_expected(__FUNCTION__, node, 
>>>>>>>>>> CRMD_JOINSTATE_DOWN);
>>>>>>>>>>      +                crmd_peer_down(node, FALSE);
>>>>>>>>>>                       check_join_state(fsa_state, __FUNCTION__);
>>>>>>>>>>
>>>>>>>>>>                       update_graph(transition_graph, down);
>>>>>>>>>>      diff --git a/crmd/crmd_utils.h b/crmd/crmd_utils.h
>>>>>>>>>>      index bc472c2..1a2577a 100644
>>>>>>>>>>      --- a/crmd/crmd_utils.h
>>>>>>>>>>      +++ b/crmd/crmd_utils.h
>>>>>>>>>>      @@ -100,6 +100,7 @@ void crmd_join_phase_log(int level);
>>>>>>>>>>       const char *get_timer_desc(fsa_timer_t * timer);
>>>>>>>>>>       gboolean too_many_st_failures(void);
>>>>>>>>>>       void st_fail_count_reset(const char * target);
>>>>>>>>>>      +void crmd_peer_down(crm_node_t *peer, bool full);
>>>>>>>>>>
>>>>>>>>>>       #  define fsa_register_cib_callback(id, flag, data, fn) do {   
>>>>>>>>>>            \
>>>>>>>>>>               fsa_cib_conn->cmds->register_callback(                 
>>>>>>>>>>          \
>>>>>>>>>>      diff --git a/crmd/te_actions.c b/crmd/te_actions.c
>>>>>>>>>>      index f31d4ec..3bfce59 100644
>>>>>>>>>>      --- a/crmd/te_actions.c
>>>>>>>>>>      +++ b/crmd/te_actions.c
>>>>>>>>>>      @@ -80,11 +80,8 @@ send_stonith_update(crm_action_t * action, 
>>>>>>>>>> const char *target, const char *uuid)
>>>>>>>>>>               crm_info("Recording uuid '%s' for node '%s'", uuid, 
>>>>>>>>>> target);
>>>>>>>>>>               peer->uuid = strdup(uuid);
>>>>>>>>>>           }
>>>>>>>>>>      -    crm_update_peer_proc(__FUNCTION__, peer, crm_proc_none, 
>>>>>>>>>> NULL);
>>>>>>>>>>      -    crm_update_peer_state(__FUNCTION__, peer, CRM_NODE_LOST, 
>>>>>>>>>> 0);
>>>>>>>>>>      -    crm_update_peer_expected(__FUNCTION__, peer, 
>>>>>>>>>> CRMD_JOINSTATE_DOWN);
>>>>>>>>>>      -    crm_update_peer_join(__FUNCTION__, peer, crm_join_none);
>>>>>>>>>>
>>>>>>>>>>      +    crmd_peer_down(peer, TRUE);
>>>>>>>>>>           node_state =
>>>>>>>>>>               do_update_node_cib(peer,
>>>>>>>>>>                                  node_update_cluster | 
>>>>>>>>>> node_update_peer | node_update_join |
>>>>>>>>>>      diff --git a/crmd/te_utils.c b/crmd/te_utils.c
>>>>>>>>>>      index ad7e573..0c92e95 100644
>>>>>>>>>>      --- a/crmd/te_utils.c
>>>>>>>>>>      +++ b/crmd/te_utils.c
>>>>>>>>>>      @@ -247,10 +247,7 @@ tengine_stonith_notify(stonith_t * st, 
>>>>>>>>>> stonith_event_t * st_event)
>>>>>>>>>>
>>>>>>>>>>               }
>>>>>>>>>>
>>>>>>>>>>      -        crm_update_peer_proc(__FUNCTION__, peer, 
>>>>>>>>>> crm_proc_none, NULL);
>>>>>>>>>>      -        crm_update_peer_state(__FUNCTION__, peer, 
>>>>>>>>>> CRM_NODE_LOST, 0);
>>>>>>>>>>      -        crm_update_peer_expected(__FUNCTION__, peer, 
>>>>>>>>>> CRMD_JOINSTATE_DOWN);
>>>>>>>>>>      -        crm_update_peer_join(__FUNCTION__, peer, 
>>>>>>>>>> crm_join_none);
>>>>>>>>>>      +        crmd_peer_down(peer, TRUE);
>>>>>>>>>>            }
>>>>>>>>>>       }
>>>>>>>>>>
>>>>>>>>>>      diff --git a/crmd/utils.c b/crmd/utils.c
>>>>>>>>>>      index 3988cfe..2df53ab 100644
>>>>>>>>>>      --- a/crmd/utils.c
>>>>>>>>>>      +++ b/crmd/utils.c
>>>>>>>>>>      @@ -1077,3 +1077,13 @@ update_attrd_remote_node_removed(const 
>>>>>>>>>> char *host, const char *user_name)
>>>>>>>>>>           crm_trace("telling attrd to clear attributes for remote 
>>>>>>>>>> host %s", host);
>>>>>>>>>>           update_attrd_helper(host, NULL, NULL, user_name, TRUE, 
>>>>>>>>>> 'C');
>>>>>>>>>>       }
>>>>>>>>>>      +
>>>>>>>>>>      +void crmd_peer_down(crm_node_t *peer, bool full)
>>>>>>>>>>      +{
>>>>>>>>>>      +    if(full && peer->state == NULL) {
>>>>>>>>>>      +        crm_update_peer_state(__FUNCTION__, peer, 
>>>>>>>>>> CRM_NODE_LOST, 0);
>>>>>>>>>>      +        crm_update_peer_proc(__FUNCTION__, peer, 
>>>>>>>>>> crm_proc_none, NULL);
>>>>>>>>>>      +    }
>>>>>>>>>>      +    crm_update_peer_join(__FUNCTION__, peer, crm_join_none);
>>>>>>>>>>      +    crm_update_peer_expected(__FUNCTION__, peer, 
>>>>>>>>>> CRMD_JOINSTATE_DOWN);
>>>>>>>>>>      +}
>>>>>>>>>>
>>>>>>>>>>      On 16 Jan 2014, at 7:24 pm, Andrey Groshev <[email protected]> 
>>>>>>>>>> wrote:
>>>>>>>>>>>       16.01.2014, 01:30, "Andrew Beekhof" <[email protected]>:
>>>>>>>>>>>>       On 16 Jan 2014, at 12:41 am, Andrey Groshev 
>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>        15.01.2014, 02:53, "Andrew Beekhof" <[email protected]>:
>>>>>>>>>>>>>>        On 15 Jan 2014, at 12:15 am, Andrey Groshev 
>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>         14.01.2014, 10:00, "Andrey Groshev" <[email protected]>:
>>>>>>>>>>>>>>>>         14.01.2014, 07:47, "Andrew Beekhof" 
>>>>>>>>>>>>>>>> <[email protected]>:
>>>>>>>>>>>>>>>>>          Ok, here's what happens:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>          1. node2 is lost
>>>>>>>>>>>>>>>>>          2. fencing of node2 starts
>>>>>>>>>>>>>>>>>          3. node2 reboots (and cluster starts)
>>>>>>>>>>>>>>>>>          4. node2 returns to the membership
>>>>>>>>>>>>>>>>>          5. node2 is marked as a cluster member
>>>>>>>>>>>>>>>>>          6. DC tries to bring it into the cluster, but needs 
>>>>>>>>>>>>>>>>> to cancel the active transition first.
>>>>>>>>>>>>>>>>>             Which is a problem since the node2 fencing 
>>>>>>>>>>>>>>>>> operation is part of that
>>>>>>>>>>>>>>>>>          7. node2 is in a transition (pending) state until 
>>>>>>>>>>>>>>>>> fencing passes or fails
>>>>>>>>>>>>>>>>>          8a. fencing fails: transition completes and the node 
>>>>>>>>>>>>>>>>> joins the cluster
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>          Thats in theory, except we automatically try again. 
>>>>>>>>>>>>>>>>> Which isn't appropriate.
>>>>>>>>>>>>>>>>>          This should be relatively easy to fix.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>          8b. fencing passes: the node is incorrectly marked 
>>>>>>>>>>>>>>>>> as offline
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>          This I have no idea how to fix yet.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>          On another note, it doesn't look like this agent 
>>>>>>>>>>>>>>>>> works at all.
>>>>>>>>>>>>>>>>>          The node has been back online for a long time and 
>>>>>>>>>>>>>>>>> the agent is still timing out after 10 minutes.
>>>>>>>>>>>>>>>>>          So "Once the script makes sure that the victim will 
>>>>>>>>>>>>>>>>> rebooted and again available via ssh - it exit with 0." does 
>>>>>>>>>>>>>>>>> not seem true.
>>>>>>>>>>>>>>>>         Damn. Looks like you're right. At some time I broke my 
>>>>>>>>>>>>>>>> agent and had not noticed it. Who will understand.
>>>>>>>>>>>>>>>         I repaired my agent - after send reboot he is wait 
>>>>>>>>>>>>>>> STDIN.
>>>>>>>>>>>>>>>         Returned "normally" a behavior - hangs "pending", until 
>>>>>>>>>>>>>>> manually send reboot. :)
>>>>>>>>>>>>>>        Right. Now you're in case 8b.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>        Can you try this patch:  
>>>>>>>>>>>>>> http://paste.fedoraproject.org/68450/38973966
>>>>>>>>>>>>>        Killed all day experiences.
>>>>>>>>>>>>>        It turns out here that:
>>>>>>>>>>>>>        1. Did cluster.
>>>>>>>>>>>>>        2. On the node-2 send signal (-4) - killed corosink
>>>>>>>>>>>>>        3. From node-1 (there DC) - stonith sent reboot
>>>>>>>>>>>>>        4. Noda rebooted and resources start.
>>>>>>>>>>>>>        5. Again. On the node-2 send signal (-4) - killed corosink
>>>>>>>>>>>>>        6. Again. From node-1 (there DC) - stonith sent reboot
>>>>>>>>>>>>>        7. Noda-2 rebooted and hangs in "pending"
>>>>>>>>>>>>>        8. Waiting, waiting..... manually reboot.
>>>>>>>>>>>>>        9. Noda-2 reboot and raised resources start.
>>>>>>>>>>>>>        10. GOTO p.2
>>>>>>>>>>>>       Logs?
>>>>>>>>>>>       Yesterday I wrote an additional letter why not put the logs.
>>>>>>>>>>>       Read it please, it contains a few more questions.
>>>>>>>>>>>       Today again began to hang and continue along the same cycle.
>>>>>>>>>>>       Logs here http://send2me.ru/crmrep2.tar.bz2
>>>>>>>>>>>>>>>         New logs: http://send2me.ru/crmrep1.tar.bz2
>>>>>>>>>>>>>>>>>          On 14 Jan 2014, at 1:19 pm, Andrew Beekhof 
>>>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>           Apart from anything else, your timeout needs to be 
>>>>>>>>>>>>>>>>>> bigger:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>           Jan 13 12:21:36 [17223] 
>>>>>>>>>>>>>>>>>> dev-cluster2-node1.unix.tensor.ru stonith-ng: (  
>>>>>>>>>>>>>>>>>> commands.c:1321  )   error: log_operation: Operation 
>>>>>>>>>>>>>>>>>> 'reboot' [11331] (call 2 from crmd.17227) for host 
>>>>>>>>>>>>>>>>>> 'dev-cluster2-node2.unix.tensor.ru' with device 'st1' 
>>>>>>>>>>>>>>>>>> returned: -62 (Timer expired)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>           On 14 Jan 2014, at 7:18 am, Andrew Beekhof 
>>>>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>           On 13 Jan 2014, at 8:31 pm, Andrey Groshev 
>>>>>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>           13.01.2014, 02:51, "Andrew Beekhof" 
>>>>>>>>>>>>>>>>>>>> <[email protected]>:
>>>>>>>>>>>>>>>>>>>>>           On 10 Jan 2014, at 9:55 pm, Andrey Groshev 
>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>           10.01.2014, 14:31, "Andrey Groshev" 
>>>>>>>>>>>>>>>>>>>>>> <[email protected]>:
>>>>>>>>>>>>>>>>>>>>>>>           10.01.2014, 14:01, "Andrew Beekhof" 
>>>>>>>>>>>>>>>>>>>>>>> <[email protected]>:
>>>>>>>>>>>>>>>>>>>>>>>>           On 10 Jan 2014, at 5:03 pm, Andrey Groshev 
>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>            10.01.2014, 05:29, "Andrew Beekhof" 
>>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]>:
>>>>>>>>>>>>>>>>>>>>>>>>>>             On 9 Jan 2014, at 11:11 pm, Andrey 
>>>>>>>>>>>>>>>>>>>>>>>>>> Groshev <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>              08.01.2014, 06:22, "Andrew Beekhof" 
>>>>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]>:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>              On 29 Nov 2013, at 7:17 pm, Andrey 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Groshev <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>               Hi, ALL.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>               I'm still trying to cope with the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fact that after the fence - node hangs in 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "pending".
>>>>>>>>>>>>>>>>>>>>>>>>>>>>              Please define "pending".  Where did 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> you see this?
>>>>>>>>>>>>>>>>>>>>>>>>>>>              In crm_mon:
>>>>>>>>>>>>>>>>>>>>>>>>>>>              ......
>>>>>>>>>>>>>>>>>>>>>>>>>>>              Node dev-cluster2-node2 (172793105): 
>>>>>>>>>>>>>>>>>>>>>>>>>>> pending
>>>>>>>>>>>>>>>>>>>>>>>>>>>              ......
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>              The experiment was like this:
>>>>>>>>>>>>>>>>>>>>>>>>>>>              Four nodes in cluster.
>>>>>>>>>>>>>>>>>>>>>>>>>>>              On one of them kill corosync or 
>>>>>>>>>>>>>>>>>>>>>>>>>>> pacemakerd (signal 4 or 6 oк 11).
>>>>>>>>>>>>>>>>>>>>>>>>>>>              Thereafter, the remaining start it 
>>>>>>>>>>>>>>>>>>>>>>>>>>> constantly reboot, under various pretexts, "softly 
>>>>>>>>>>>>>>>>>>>>>>>>>>> whistling", "fly low", "not a cluster member!" ...
>>>>>>>>>>>>>>>>>>>>>>>>>>>              Then in the log fell out "Too many 
>>>>>>>>>>>>>>>>>>>>>>>>>>> failures ...."
>>>>>>>>>>>>>>>>>>>>>>>>>>>              All this time in the status in crm_mon 
>>>>>>>>>>>>>>>>>>>>>>>>>>> is "pending".
>>>>>>>>>>>>>>>>>>>>>>>>>>>              Depending on the wind direction 
>>>>>>>>>>>>>>>>>>>>>>>>>>> changed to "UNCLEAN"
>>>>>>>>>>>>>>>>>>>>>>>>>>>              Much time has passed and I can not 
>>>>>>>>>>>>>>>>>>>>>>>>>>> accurately describe the behavior...
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>              Now I am in the following state:
>>>>>>>>>>>>>>>>>>>>>>>>>>>              I tried locate the problem. Came here 
>>>>>>>>>>>>>>>>>>>>>>>>>>> with this.
>>>>>>>>>>>>>>>>>>>>>>>>>>>              I set big value in property 
>>>>>>>>>>>>>>>>>>>>>>>>>>> stonith-timeout="600s".
>>>>>>>>>>>>>>>>>>>>>>>>>>>              And got the following behavior:
>>>>>>>>>>>>>>>>>>>>>>>>>>>              1. pkill -4 corosync
>>>>>>>>>>>>>>>>>>>>>>>>>>>              2. from node with DC call my fence 
>>>>>>>>>>>>>>>>>>>>>>>>>>> agent "sshbykey"
>>>>>>>>>>>>>>>>>>>>>>>>>>>              3. It sends reboot victim and waits 
>>>>>>>>>>>>>>>>>>>>>>>>>>> until she comes to life again.
>>>>>>>>>>>>>>>>>>>>>>>>>>             Hmmm.... what version of pacemaker?
>>>>>>>>>>>>>>>>>>>>>>>>>>             This sounds like a timing issue that we 
>>>>>>>>>>>>>>>>>>>>>>>>>> fixed a while back
>>>>>>>>>>>>>>>>>>>>>>>>>            Was a version 1.1.11 from December 3.
>>>>>>>>>>>>>>>>>>>>>>>>>            Now try full update and retest.
>>>>>>>>>>>>>>>>>>>>>>>>           That should be recent enough.  Can you 
>>>>>>>>>>>>>>>>>>>>>>>> create a crm_report the next time you reproduce?
>>>>>>>>>>>>>>>>>>>>>>>           Of course yes. Little delay.... :)
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>           ......
>>>>>>>>>>>>>>>>>>>>>>>           cc1: warnings being treated as errors
>>>>>>>>>>>>>>>>>>>>>>>           upstart.c: In function ‘upstart_job_property’:
>>>>>>>>>>>>>>>>>>>>>>>           upstart.c:264: error: implicit declaration of 
>>>>>>>>>>>>>>>>>>>>>>> function ‘g_variant_lookup_value’
>>>>>>>>>>>>>>>>>>>>>>>           upstart.c:264: error: nested extern 
>>>>>>>>>>>>>>>>>>>>>>> declaration of ‘g_variant_lookup_value’
>>>>>>>>>>>>>>>>>>>>>>>           upstart.c:264: error: assignment makes 
>>>>>>>>>>>>>>>>>>>>>>> pointer from integer without a cast
>>>>>>>>>>>>>>>>>>>>>>>           gmake[2]: *** [libcrmservice_la-upstart.lo] 
>>>>>>>>>>>>>>>>>>>>>>> Error 1
>>>>>>>>>>>>>>>>>>>>>>>           gmake[2]: Leaving directory 
>>>>>>>>>>>>>>>>>>>>>>> `/root/ha/pacemaker/lib/services'
>>>>>>>>>>>>>>>>>>>>>>>           make[1]: *** [all-recursive] Error 1
>>>>>>>>>>>>>>>>>>>>>>>           make[1]: Leaving directory 
>>>>>>>>>>>>>>>>>>>>>>> `/root/ha/pacemaker/lib'
>>>>>>>>>>>>>>>>>>>>>>>           make: *** [core] Error 1
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>           I'm trying to solve this a problem.
>>>>>>>>>>>>>>>>>>>>>>           Do not get solved quickly...
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>           
>>>>>>>>>>>>>>>>>>>>>> https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value
>>>>>>>>>>>>>>>>>>>>>>           g_variant_lookup_value () Since 2.28
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>           # yum list installed glib2
>>>>>>>>>>>>>>>>>>>>>>           Loaded plugins: fastestmirror, rhnplugin, 
>>>>>>>>>>>>>>>>>>>>>> security
>>>>>>>>>>>>>>>>>>>>>>           This system is receiving updates from RHN 
>>>>>>>>>>>>>>>>>>>>>> Classic or Red Hat Satellite.
>>>>>>>>>>>>>>>>>>>>>>           Loading mirror speeds from cached hostfile
>>>>>>>>>>>>>>>>>>>>>>           Installed Packages
>>>>>>>>>>>>>>>>>>>>>>           glib2.x86_64                                   
>>>>>>>>>>>>>>>>>>>>>>                            2.26.1-3.el6                  
>>>>>>>>>>>>>>>>>>>>>>                                              installed
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>           # cat /etc/issue
>>>>>>>>>>>>>>>>>>>>>>           CentOS release 6.5 (Final)
>>>>>>>>>>>>>>>>>>>>>>           Kernel \r on an \m
>>>>>>>>>>>>>>>>>>>>>           Can you try this patch?
>>>>>>>>>>>>>>>>>>>>>           Upstart jobs wont work, but the code will 
>>>>>>>>>>>>>>>>>>>>> compile
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>           diff --git a/lib/services/upstart.c 
>>>>>>>>>>>>>>>>>>>>> b/lib/services/upstart.c
>>>>>>>>>>>>>>>>>>>>>           index 831e7cf..195c3a4 100644
>>>>>>>>>>>>>>>>>>>>>           --- a/lib/services/upstart.c
>>>>>>>>>>>>>>>>>>>>>           +++ b/lib/services/upstart.c
>>>>>>>>>>>>>>>>>>>>>           @@ -231,12 +231,21 @@ upstart_job_exists(const 
>>>>>>>>>>>>>>>>>>>>> char *name)
>>>>>>>>>>>>>>>>>>>>>           static char *
>>>>>>>>>>>>>>>>>>>>>           upstart_job_property(const char *obj, const 
>>>>>>>>>>>>>>>>>>>>> gchar * iface, const char *name)
>>>>>>>>>>>>>>>>>>>>>           {
>>>>>>>>>>>>>>>>>>>>>           +    char *output = NULL;
>>>>>>>>>>>>>>>>>>>>>           +
>>>>>>>>>>>>>>>>>>>>>           +#if !GLIB_CHECK_VERSION(2,28,0)
>>>>>>>>>>>>>>>>>>>>>           +    static bool err = TRUE;
>>>>>>>>>>>>>>>>>>>>>           +
>>>>>>>>>>>>>>>>>>>>>           +    if(err) {
>>>>>>>>>>>>>>>>>>>>>           +        crm_err("This version of glib is too 
>>>>>>>>>>>>>>>>>>>>> old to support upstart jobs");
>>>>>>>>>>>>>>>>>>>>>           +        err = FALSE;
>>>>>>>>>>>>>>>>>>>>>           +    }
>>>>>>>>>>>>>>>>>>>>>           +#else
>>>>>>>>>>>>>>>>>>>>>              GError *error = NULL;
>>>>>>>>>>>>>>>>>>>>>              GDBusProxy *proxy;
>>>>>>>>>>>>>>>>>>>>>              GVariant *asv = NULL;
>>>>>>>>>>>>>>>>>>>>>              GVariant *value = NULL;
>>>>>>>>>>>>>>>>>>>>>              GVariant *_ret = NULL;
>>>>>>>>>>>>>>>>>>>>>           -    char *output = NULL;
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>              crm_info("Calling GetAll on %s", obj);
>>>>>>>>>>>>>>>>>>>>>              proxy = get_proxy(obj, BUS_PROPERTY_IFACE);
>>>>>>>>>>>>>>>>>>>>>           @@ -272,6 +281,7 @@ upstart_job_property(const 
>>>>>>>>>>>>>>>>>>>>> char *obj, const gchar * iface, const char *name)
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>              g_object_unref(proxy);
>>>>>>>>>>>>>>>>>>>>>              g_variant_unref(_ret);
>>>>>>>>>>>>>>>>>>>>>           +#endif
>>>>>>>>>>>>>>>>>>>>>              return output;
>>>>>>>>>>>>>>>>>>>>>           }
>>>>>>>>>>>>>>>>>>>>           Ok :) I patch source.
>>>>>>>>>>>>>>>>>>>>           Type "make rc" - the same error.
>>>>>>>>>>>>>>>>>>>           Because its not building your local changes
>>>>>>>>>>>>>>>>>>>>           Make new copy via "fetch" - the same error.
>>>>>>>>>>>>>>>>>>>>           It seems that if not exist 
>>>>>>>>>>>>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz, then 
>>>>>>>>>>>>>>>>>>>> download it.
>>>>>>>>>>>>>>>>>>>>           Otherwise use exist archive.
>>>>>>>>>>>>>>>>>>>>           Cutted log .......
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>           # make rc
>>>>>>>>>>>>>>>>>>>>           make TAG=Pacemaker-1.1.11-rc3 rpm
>>>>>>>>>>>>>>>>>>>>           make[1]: Entering directory `/root/ha/pacemaker'
>>>>>>>>>>>>>>>>>>>>           rm -f pacemaker-dirty.tar.* pacemaker-tip.tar.* 
>>>>>>>>>>>>>>>>>>>> pacemaker-HEAD.tar.*
>>>>>>>>>>>>>>>>>>>>           if [ ! -f 
>>>>>>>>>>>>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz ]; then  
>>>>>>>>>>>>>>>>>>>>                                            \
>>>>>>>>>>>>>>>>>>>>                    rm -f pacemaker.tar.*;                  
>>>>>>>>>>>>>>>>>>>>                             \
>>>>>>>>>>>>>>>>>>>>                    if [ Pacemaker-1.1.11-rc3 = dirty ]; 
>>>>>>>>>>>>>>>>>>>> then                                   \
>>>>>>>>>>>>>>>>>>>>                        git commit -m "DO-NOT-PUSH" -a;     
>>>>>>>>>>>>>>>>>>>>                             \
>>>>>>>>>>>>>>>>>>>>                        git archive 
>>>>>>>>>>>>>>>>>>>> --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ HEAD 
>>>>>>>>>>>>>>>>>>>> | gzip > 
>>>>>>>>>>>>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;       \
>>>>>>>>>>>>>>>>>>>>                        git reset --mixed HEAD^;            
>>>>>>>>>>>>>>>>>>>>                             \
>>>>>>>>>>>>>>>>>>>>                    else                                    
>>>>>>>>>>>>>>>>>>>>                             \
>>>>>>>>>>>>>>>>>>>>                        git archive 
>>>>>>>>>>>>>>>>>>>> --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ 
>>>>>>>>>>>>>>>>>>>> Pacemaker-1.1.11-rc3 | gzip > 
>>>>>>>>>>>>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;     \
>>>>>>>>>>>>>>>>>>>>                    fi;                                     
>>>>>>>>>>>>>>>>>>>>                             \
>>>>>>>>>>>>>>>>>>>>                    echo `date`: Rebuilt 
>>>>>>>>>>>>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;         
>>>>>>>>>>>>>>>>>>>>                             \
>>>>>>>>>>>>>>>>>>>>                else                                        
>>>>>>>>>>>>>>>>>>>>                             \
>>>>>>>>>>>>>>>>>>>>                    echo `date`: Using existing tarball: 
>>>>>>>>>>>>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;         
>>>>>>>>>>>>>>>>>>>>             \
>>>>>>>>>>>>>>>>>>>>                fi
>>>>>>>>>>>>>>>>>>>>           Mon Jan 13 13:23:21 MSK 2014: Using existing 
>>>>>>>>>>>>>>>>>>>> tarball: ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz
>>>>>>>>>>>>>>>>>>>>           .......
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>           Well, "make rpm" - build rpms and I create 
>>>>>>>>>>>>>>>>>>>> cluster.
>>>>>>>>>>>>>>>>>>>>           I spent the same tests and confirmed the 
>>>>>>>>>>>>>>>>>>>> behavior.
>>>>>>>>>>>>>>>>>>>>           crm_reoprt log here - 
>>>>>>>>>>>>>>>>>>>> http://send2me.ru/crmrep.tar.bz2
>>>>>>>>>>>>>>>>>>>           Thanks!
>>>>>>>>>>>>>>>>>          ,
>>>>>>>>>>>>>>>>>          _______________________________________________
>>>>>>>>>>>>>>>>>          Pacemaker mailing list: [email protected]
>>>>>>>>>>>>>>>>>          http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>          Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>>>>>          Getting started: 
>>>>>>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>>>>>          Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>>>>>         _______________________________________________
>>>>>>>>>>>>>>>>         Pacemaker mailing list: [email protected]
>>>>>>>>>>>>>>>>         http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>         Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>>>>         Getting started: 
>>>>>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>>>>         Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>>>>         _______________________________________________
>>>>>>>>>>>>>>>         Pacemaker mailing list: [email protected]
>>>>>>>>>>>>>>>         http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>         Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>>>         Getting started: 
>>>>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>>>         Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>>>        ,
>>>>>>>>>>>>>>        _______________________________________________
>>>>>>>>>>>>>>        Pacemaker mailing list: [email protected]
>>>>>>>>>>>>>>        http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>        Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>>        Getting started: 
>>>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>>        Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>>        _______________________________________________
>>>>>>>>>>>>>        Pacemaker mailing list: [email protected]
>>>>>>>>>>>>>        http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>>
>>>>>>>>>>>>>        Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>        Getting started: 
>>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>        Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>       ,
>>>>>>>>>>>>       _______________________________________________
>>>>>>>>>>>>       Pacemaker mailing list: [email protected]
>>>>>>>>>>>>       http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>
>>>>>>>>>>>>       Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>       Getting started: 
>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>       Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>       _______________________________________________
>>>>>>>>>>>       Pacemaker mailing list: [email protected]
>>>>>>>>>>>       http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>
>>>>>>>>>>>       Project Home: http://www.clusterlabs.org
>>>>>>>>>>>       Getting started: 
>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>       Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>      ,
>>>>>>>>>>      _______________________________________________
>>>>>>>>>>      Pacemaker mailing list: [email protected]
>>>>>>>>>>      http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>
>>>>>>>>>>      Project Home: http://www.clusterlabs.org
>>>>>>>>>>      Getting started: 
>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>      Bugs: http://bugs.clusterlabs.org
>>>>>>>>>      _______________________________________________
>>>>>>>>>      Pacemaker mailing list: [email protected]
>>>>>>>>>      http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>
>>>>>>>>>      Project Home: http://www.clusterlabs.org
>>>>>>>>>      Getting started: 
>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>      Bugs: http://bugs.clusterlabs.org
>>>>>>>>     ,
>>>>>>>>     _______________________________________________
>>>>>>>>     Pacemaker mailing list: [email protected]
>>>>>>>>     http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>
>>>>>>>>     Project Home: http://www.clusterlabs.org
>>>>>>>>     Getting started: 
>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>     Bugs: http://bugs.clusterlabs.org
>>>>>>>    _______________________________________________
>>>>>>>    Pacemaker mailing list: [email protected]
>>>>>>>    http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>
>>>>>>>    Project Home: http://www.clusterlabs.org
>>>>>>>    Getting started: 
>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>    Bugs: http://bugs.clusterlabs.org
>>>>>>    _______________________________________________
>>>>>>    Pacemaker mailing list: [email protected]
>>>>>>    http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>
>>>>>>    Project Home: http://www.clusterlabs.org
>>>>>>    Getting started: 
>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>    Bugs: http://bugs.clusterlabs.org
>>>>>   ,
>>>>>   _______________________________________________
>>>>>   Pacemaker mailing list: [email protected]
>>>>>   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>>   Project Home: http://www.clusterlabs.org
>>>>>   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>   Bugs: http://bugs.clusterlabs.org
>>>>   _______________________________________________
>>>>   Pacemaker mailing list: [email protected]
>>>>   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>>   Project Home: http://www.clusterlabs.org
>>>>   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>   Bugs: http://bugs.clusterlabs.org
>>>  ,
>>>  _______________________________________________
>>>  Pacemaker mailing list: [email protected]
>>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>>  Project Home: http://www.clusterlabs.org
>>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>  Bugs: http://bugs.clusterlabs.org
>>  _______________________________________________
>>  Pacemaker mailing list: [email protected]
>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>>  Project Home: http://www.clusterlabs.org
>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>  Bugs: http://bugs.clusterlabs.org
>
> ,
> _______________________________________________
> Pacemaker mailing list: [email protected]
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: [email protected]
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] hangs pending

Reply via email to