On 10 Jun 2014, at 10:07 am, Andrew Beekhof <and...@beekhof.net> wrote:
> > On 10 Jun 2014, at 9:56 am, Gabriel Gomiz <ggo...@cooperativaobrera.coop> > wrote: > >> On 05/30/2014 12:12 AM, Andrew Beekhof wrote: >>> There have been some big steps forward in cib for the next upstream release >>> (its basically 2 orders of magnitude faster/more efficient). >>> Current versions will regularly max out a core, albeit for hopefully short >>> periods of time depending on the cluster size: >>> >>> https://twitter.com/beekhof/status/412913549837475840 >>> >>> Its also a vicious circle - a busy cib leads to failed resource actions, >>> which leads to recovery operations, which leads to more work for the cib. >>> >>> Looking at the size of your cluster, 87 resources on 4 nodes... I can >>> imagine that benefitting greatly from the coming version. >>> >>> I notice you're using a rhel package, are you a RH customer or is this on a >>> clone? >> Clone. CentOS. > > Ah ok. In that case your best bet is to keep using upstream until 1.1.12 > filters down to RHEL and then CentOS > >>> Also, did anything specific happen prior to the CIB going nuts? >>>> Only thing that I can think of is a lot of calls to crm_mon via a shell >>>> script that we use to check >>>> which resource groups each node is servicing (attached if you're curious). >>>> We use this script to apply puppet manifests conditionally to our nodes >>>> and do some monitoring. Also >>>> we have cron jobs checking via the script if the resource group is active >>>> before running. >>>> Maybe the sum of that calls can make cib process very busy...? >>> If you were running it every second... maybe. But something is _seriously_ >>> wrong if -KILL isn't working! >>> I wonder how much memory it was using at the time... perhaps the kernel was >>> trying to write a huge core file? >> I don't think so. It was several days in that state. >> >> Is there any way to check if a node has a resource group via a single simple >> call to crm resource? >> Because I didn't found a way we had to make a script that parse the entire >> crm_mon output. crm_mon can also produce an xml version which should be easy consumable/parseable by a python or perl script. perhaps that helps >>> >>>> Anyway, I've built 1.1.12 rc1 RPMS and this morning I've upgraded the >>>> cluster. Will let you know if >>>> there is something weird after this upgrade. >>> Ok, I'd be interested to hear your feedback. >> >> 1.1.12 rc1 working flawlessly until now. So it looks like it's fixed in that >> version. >> >> Thanks! > > Glad to hear we could give you a solution :)
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org