On 27 May 2014, at 5:48 am, Gabriel Gomiz <ggo...@cooperativaobrera.coop> wrote:

> Hello Andrew and cluster folks!
> 
> In the last month we are experiencing some weird problem with cib process in 
> one of our nodes
> ('gandalf'), it's a 4-node cluster. Brief description:
> 
> After some undetermined reason (we still can't figure out why) it begins 
> looping infinitely and
> consuming 100% CPU.

Apart from the CPU usage, is there something in particular that makes you think 
its looping?
There have been some big steps forward in cib for the next upstream release 
(its basically 2 orders of magnitude faster/more efficient).
Current versions will regularly max out a core, albeit for hopefully short 
periods of time depending on the cluster size:

        https://twitter.com/beekhof/status/412913549837475840

Its also a vicious circle - a busy cib leads to failed resource actions, which 
leads to recovery operations, which leads to more work for the cib.

Looking at the size of your cluster, 87 resources on 4 nodes... I can imagine 
that benefitting greatly from the coming version.

I notice you're using a rhel package, are you a RH customer or is this on a 
clone?
Also, did anything specific happen prior to the CIB going nuts?


> After that crm_mon command can't connect to pacemaker returning this output:
> 
> Could not establish cib_ro connection: Resource temporarily unavailable (11)
> 
> Connection to cluster failed: Transport endpoint is not connected
> 
> Pacemaker process look like this:
> 
> [DB1] gandalf # psx | grep pacemaker
> root     25966  0.0  0.0  80480  2840 ?        S    May23   0:15 pacemakerd
> 496      25972 83.8  0.0 111632 27888 ?        Rs   May23 4045:07  \_ 
> /usr/libexec/pacemaker/cib
> root     25973  0.0  0.0 101716 12424 ?        Ss   May23   0:19  \_ 
> /usr/libexec/pacemaker/stonithd
> root     25974  0.0  0.0  76644  3552 ?        Ss   May23   0:30  \_ 
> /usr/libexec/pacemaker/lrmd
> 496      25975  0.0  0.0  89624  3368 ?        Ss   May23   0:15  \_ 
> /usr/libexec/pacemaker/attrd
> 496      25976  0.0  0.0  81172  2568 ?        Ss   May23   0:14  \_ 
> /usr/libexec/pacemaker/pengine
> root     25977  0.0  0.0 107700  7116 ?        Ss   May23   0:17  \_ 
> /usr/libexec/pacemaker/crmd
> 
> Cluster is still operating normally with all resources running and this node 
> is reported alive in
> the other 3 members:
> 
> [VM2] lorien # crm_mon -1
> Last updated: Mon May 26 16:38:18 2014
> Last change: Fri May 23 08:02:17 2014 via cibadmin on 
> lorien.san01.cooperativaobrera.coop
> Stack: cman
> Current DC: lorien.san01.cooperativaobrera.coop - partition with quorum
> Version: 1.1.10-14.el6_5.2-368c726
> 4 Nodes configured
> 87 Resources configured
> 
> Online: [ gandalf.san01.cooperativaobrera.coop 
> isildur.san01.cooperativaobrera.coop
> lorien.san01.cooperativaobrera.coop mordor.san01.cooperativaobrera.coop ]
> 
> In this moment the node is in that state, I don't want to move resources 
> because I don't know how
> the cluster will react in this state. Please if you want me to make some 
> tests or collect logs I'll
> leave the node in that state to make any test you want.
> 
> Logs stopped just before cib process started looping. Last messages are:
> 
> May 23 21:02:01 [25972] gandalf.cooperativaobrera.coop        cib:     info:
> crm_compress_string:       Compressed 258760 bytes into 14072 (ratio 18:1) in 
> 67ms
> May 23 21:02:01 [25972] gandalf.cooperativaobrera.coop        cib:     info:
> crm_client_destroy:        Destroying 0 events
> May 23 21:03:01 [25972] gandalf.cooperativaobrera.coop        cib:     info: 
> crm_client_new:   
> Connecting 0x2ddec30 for uid=0 gid=0 pid=17759 
> id=2ce51a5a-a70e-4b24-8726-b97b6f9013fd
> 
> Finally, cib process in this state can't be killed. Not even with "-9". We 
> have to reboot the node
> to clean pacemaker and start again.
> 
> System is CentOS 6, with official packages. We had version 
> pacemaker-1.1.10-14.el6_5.2.x86_64. After
> the last reboot we upgraded to pacemaker-1.1.10-14.el6_5.3.x86_64 and problem 
> still exists.
> 
> Have any of you experienced something like this?
> 
> Thanks in advance for any help!
> 
> Cheers
> 
> -- 
> Lic. Gabriel Gomiz - Jefe de Sistemas / Administrador
> ggo...@cooperativaobrera.coop
> Gerencia de Sistemas - Cooperativa Obrera Ltda.
> Tel: (0291) 403-9700
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to