Re: Server increasing load due increasing processes in D state

2013-02-25 Thread Eduardo Damato

Hi Alessandro,

Thanks for the information.

The sysrq-t that I requested is *only* useful during the problem. Please
do that when you encounter the problem again.

It may be that you are overcommitting cpus on your system by having many
virtual machines running on the nova controller node. This is a
completely wild guess, but I would recommend you to look at how many
cpus you have and how many virtual machines and if you have any
processes in real time or sched FIFO.

Cheers,
Eduardo.

-- 
ubuntu-server mailing list
ubuntu-server@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-server
More info: https://wiki.ubuntu.com/ServerTeam


Re: Server increasing load due increasing processes in D state

2013-02-25 Thread Eduardo Damato
Hi Alessandro,

What's the node you're having problems with? Is this a compute node? Can
you give more information on the layout of your nova installation? I can
see that qemu and rabbit-mq are running on the same node. Do you use the
compute node as an MQ node as well?

The problem here seems more to be related to the kernel, since many many
tasks are stuck in the same W_CHAN.

Ideally It would be good to have the output of sysrq-t from this system,
but this can cause the system to hang or crash depending on what the
status is, specially because we already know that there are many
task_structs blocked in the same place.

you could do:

# echo t > /proc/sysrq-trigger
(wait 5 s)
# echo t > /proc/sysrq-trigger
(wait 5 s)
# echo t > /proc/sysrq-trigger

And then we can have a look at the traces and see if they're moving or not.

lsof is blocked reading the memory maps of process 1227. This could lead
to more information on the problem, but at the same time because there
are so many blocked processes it could be just another sign of the
problem and not a hint to the reason why this is happening.

Without kernel traces (sysrq-t) or a vmcore it would be complicated to
understand what's happening. It doesn't seem to be IO related.

Cheers,
Eduardo.

On 25/02/13 12:10, Alessandro Tagliapietra wrote:
> After an strace of lsof I've seen it hangs on
>
> stat("/proc/1227/", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
> open("/proc/1227/stat", O_RDONLY)   = 4
> read(4, "1227 (nova-dhcpbridge) D 1224 25"..., 4096) = 242
> close(4)= 0
> readlink("/proc/1227/cwd", "/"..., 4096) = 1
> stat("/proc/1227/cwd", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> readlink("/proc/1227/root", "/", 4096)  = 1
> stat("/proc/1227/root", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> readlink("/proc/1227/exe", "/usr/bin/python2.7"..., 4096) = 18
> stat("/proc/1227/exe", {st_mode=S_IFREG|0755, st_size=2989480, ...}) = 0
> open("/proc/1227/maps", O_RDONLY)   = 4
> read(4,
> Could it be a memory issue?
> Actually I cannot run the memory test, maybe tomorrow. Just to know if 
> someone else had the same issue.
> Thanks in advance
> --
>
> Alessandro Tagliapietra
> alexfu.it 
>
> Il giorno lunedì 25 febbraio 2013, alle ore 12:29, Alessandro
> Tagliapietra ha scritto:
>
>> Hello guys,
>>
>> at work we've the openstack controller that since some months started
>> to increase its load after some days of uptime.
>>
>> I've seen that the cause is that processes sometimes hangs and remain
>> in D state.
>>
>> I've used some combination of ps args to get these outputs:
>>
>> http://pastebin.com/raw.php?i=LGGzGrWu
>> http://pastie.org/pastes/6332964/text
>> http://pastie.org/pastes/6332979/text
>>
>> The hdd is a soft-raid1 over 2 disks, which SMART values are fine.
>>
>> Commands like lsof, strace on a D process doesn't return.
>>
>> Any idea on what could be the cause?
>>
>> Thanks in advance
>>
>> --
>>
>> Alessandro Tagliapietra
>> alexfu.it 
>
>
>

-- 
ubuntu-server mailing list
ubuntu-server@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-server
More info: https://wiki.ubuntu.com/ServerTeam