Re: [Pacemaker] Process cib loops infinitely with 100% cpu usage and can't be killed

Andrew Beekhof Mon, 09 Jun 2014 17:13:24 -0700

On 10 Jun 2014, at 10:07 am, Andrew Beekhof <and...@beekhof.net> wrote:


> 
> On 10 Jun 2014, at 9:56 am, Gabriel Gomiz <ggo...@cooperativaobrera.coop> 
> wrote:
> 
>> On 05/30/2014 12:12 AM, Andrew Beekhof wrote:
>>> There have been some big steps forward in cib for the next upstream release 
>>> (its basically 2 orders of magnitude faster/more efficient).
>>> Current versions will regularly max out a core, albeit for hopefully short 
>>> periods of time depending on the cluster size:
>>> 
>>>     https://twitter.com/beekhof/status/412913549837475840
>>> 
>>> Its also a vicious circle - a busy cib leads to failed resource actions, 
>>> which leads to recovery operations, which leads to more work for the cib.
>>> 
>>> Looking at the size of your cluster, 87 resources on 4 nodes... I can 
>>> imagine that benefitting greatly from the coming version.
>>> 
>>> I notice you're using a rhel package, are you a RH customer or is this on a 
>>> clone?
>> Clone. CentOS.
> 
> Ah ok.  In that case your best bet is to keep using upstream until 1.1.12 
> filters down to RHEL and then CentOS
> 
>>> Also, did anything specific happen prior to the CIB going nuts?
>>>> Only thing that I can think of is a lot of calls to crm_mon via a shell 
>>>> script that we use to check
>>>> which resource groups each node is servicing (attached if you're curious).
>>>> We use this script to apply puppet manifests conditionally to our nodes 
>>>> and do some monitoring. Also
>>>> we have cron jobs checking via the script if the resource group is active 
>>>> before running.
>>>> Maybe the sum of that calls can make cib process very busy...?
>>> If you were running it every second... maybe. But something is _seriously_ 
>>> wrong if -KILL isn't working!
>>> I wonder how much memory it was using at the time... perhaps the kernel was 
>>> trying to write a huge core file?
>> I don't think so. It was several days in that state.
>> 
>> Is there any way to check if a node has a resource group via a single simple 
>> call to crm resource?
>> Because I didn't found a way we had to make a script that parse the entire 
>> crm_mon output.

crm_mon can also produce an xml version which should be easy 
consumable/parseable by a python or perl script.
perhaps that helps

>>> 
>>>> Anyway, I've built 1.1.12 rc1 RPMS and this morning I've upgraded the 
>>>> cluster. Will let you know if
>>>> there is something weird after this upgrade.
>>> Ok, I'd be interested to hear your feedback.
>> 
>> 1.1.12 rc1 working flawlessly until now. So it looks like it's fixed in that 
>> version.
>> 
>> Thanks!
> 
> Glad to hear we could give you a solution :)

signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Process cib loops infinitely with 100% cpu usage and can't be killed

Reply via email to