05.07.2011 19:10, Steven Dake wrote:
> On 07/05/2011 07:26 AM, Vladislav Bogdanov wrote:
>> Hi all,
>>
>> Last days I see following messages in logs:
>> [TOTEM ] Process pause detected for XXX ms, flushing membership messages.
>>
>> After that ring is quickly re-established.
>> DLM/clvmd notifies this and switches to kern_stop waiting for fencing to
>> be done. Although what dlm_tool ls provides is really strange flags and
>> members differ between nodes. I have dumps of what has been happening in
>> dlm, and there are messages that fencing was done!
>>
>> On the other hand, pacemaker does not notify anything so fencing is not
>> done. This is rather strange, but for another list.
>>
>> Can anybody please explain what exactly that message means and what is
>> the correct reaction of upper services should be?
>> Can it be solely caused by network problems?
>> Can number of buffers in RX ring of ethernet card influence this (I did
>> some tuning there some time ago)?
>>
>> corosync 1.3.1, UDPU transport.
>> pacemaker-1.1-devel
>> dlm_controld.pcmk from 3.0.17
>> clvmd 2.02.85
>> clusterlib-3.1.1
>>
> 
> This indicates the kernel has paused scheduling (or corosync of corosync
> or corosync has blocked for the time value printed in the message.

I suspected this, thanks for clarification.

> Corosync is non-blocking.
> 
> Are you running inside a VM?  Increasing token is probably a necessity
> when running inside a VM on a heavily loaded host because kvm does not
> schedule as fairly as bare metal.
> 
> Please provide feedback if this is bare metal or m.

I see this both on one node in VM, and on bare metal hosts under high
load (30 vms are installing on each 12-core node, so CPU usage is quite
big).

I removed eth RX ring buffer tuning from physical hosts (now it is
default 256 instead of max 4096).
Will see what will happen.

This could be a problem of ethernet driver on bare metal nodes as well.

With VM I'll try to increase its weight by cgroups.

Steve, can you please also explain why I'm unable to move corosync to
another (non-default) CPU cgroup? Is this caused by a real-time
priority? I just wanted to increase its weight.

Best,
Vladislav
_______________________________________________
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to