Re: Performance of 40-way guest running 2.6.32-220 (RHEL6.2) vs. 3.3.1 OS

2012-04-19 Thread Gleb Natapov
On Wed, Apr 18, 2012 at 09:44:47PM -0700, Chegu Vinod wrote:
 On 4/17/2012 6:25 AM, Chegu Vinod wrote:
 On 4/17/2012 2:49 AM, Gleb Natapov wrote:
 On Mon, Apr 16, 2012 at 07:44:39AM -0700, Chegu Vinod wrote:
 On 4/16/2012 5:18 AM, Gleb Natapov wrote:
 On Thu, Apr 12, 2012 at 02:21:06PM -0400, Rik van Riel wrote:
 On 04/11/2012 01:21 PM, Chegu Vinod wrote:
 Hello,
 
 While running an AIM7 (workfile.high_systime) in a
 single 40-way (or a single
 60-way KVM guest) I noticed pretty bad performance when
 the guest was booted
 with 3.3.1 kernel when compared to the same guest booted
 with 2.6.32-220
 (RHEL6.2) kernel.
 For the 40-way Guest-RunA (2.6.32-220 kernel) performed
 nearly 9x better than
 the Guest-RunB (3.3.1 kernel). In the case of 60-way
 guest run the older guest
 kernel was nearly 12x better !
 How many CPUs your host has?
 80 Cores on the DL980.  (i.e. 8 Westmere sockets).
 
 So you are not oversubscribing CPUs at all. Are those real cores
 or including HT?
 
 HT is off.
 
 Do you have other cpus hogs running on the host while testing the guest?
 
 Nope.  Sometimes I do run the utilities like perf or sar or
 mpstat on the numa node 0 (where
 the guest is not running).
 
 
 I was using numactl to bind the qemu of the 40-way guests to numa
 nodes : 4-7  ( or for a 60-way guest
 binding them to nodes 2-7)
 
 /etc/qemu-ifup tap0
 
 numactl --cpunodebind=4,5,6,7 --membind=4,5,6,7
 /usr/local/bin/qemu-system-x86_64 -enable-kvm -cpu 
 Westmere,+rdtscp,+pdpe1gb,+dca,+xtpr,+tm2,+est,+vmx,+ds_cpl,+monitor,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
 -enable-kvm \
 -m 65536 -smp 40 \
 -name vm1 -chardev 
 socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait
 \
 -drive 
 file=/var/lib/libvirt/images/vmVinod1/vm1.img,if=none,id=drive-virtio-disk0,format=qcow2,cache=none
 -device virtio-blk-pci,scsi=off,bus=pci
 .0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
 -monitor stdio \
 -net nic,macaddr=..mac_addr..  \
 -net tap,ifname=tap0,script=no,downscript=no \
 -vnc :4
 
 /etc/qemu-ifdown tap0
 
 
 I knew that there will be a few additional temporary qemu worker
 threads created...  i.e. some over
 subscription  will be there.
 
 4 nodes above have 40 real cores, yes?
 
 Yes .
 Other than the qemu's related threads and some of the generic
 per-cpu Linux kernel threads (e.g. migration  etc)
 there isn't anything else running on these Numa nodes.
 
 Can you try to run upstream
 kernel without binding at all and check the performance?
 
 
 Re-ran the same workload *without* binding the qemu...but using the
 3.3.1 kernel
 
 20-way guest: Performance got much worse when compared to the case
 where bind the qemu.
 40-way guest: about the same as in the case  where we bind the qemu
 60-way guest: about the same as in the case  where we bind the qemu
 
 Trying out a couple of other experiments...
 
With 8 sockets the numa effects are probably very strong. Couple of things to
try:
1. Run vm that fits into one numa node and bind it to a numa node. Compare
   performance of rhel kernel and upstream.
2. Run vm bigger than numa node, bind vcpus to numa nodes separately and
   pass resulted topology to a guest using -numa flag.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Performance of 40-way guest running 2.6.32-220 (RHEL6.2) vs. 3.3.1 OS

2012-04-18 Thread Chegu Vinod

On 4/17/2012 6:25 AM, Chegu Vinod wrote:

On 4/17/2012 2:49 AM, Gleb Natapov wrote:

On Mon, Apr 16, 2012 at 07:44:39AM -0700, Chegu Vinod wrote:

On 4/16/2012 5:18 AM, Gleb Natapov wrote:

On Thu, Apr 12, 2012 at 02:21:06PM -0400, Rik van Riel wrote:

On 04/11/2012 01:21 PM, Chegu Vinod wrote:

Hello,

While running an AIM7 (workfile.high_systime) in a single 40-way 
(or a single
60-way KVM guest) I noticed pretty bad performance when the guest 
was booted
with 3.3.1 kernel when compared to the same guest booted with 
2.6.32-220

(RHEL6.2) kernel.
For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x 
better than
the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run 
the older guest

kernel was nearly 12x better !

How many CPUs your host has?

80 Cores on the DL980.  (i.e. 8 Westmere sockets).

So you are not oversubscribing CPUs at all. Are those real cores or 
including HT?


HT is off.


Do you have other cpus hogs running on the host while testing the guest?


Nope.  Sometimes I do run the utilities like perf or sar or 
mpstat on the numa node 0 (where

the guest is not running).




I was using numactl to bind the qemu of the 40-way guests to numa
nodes : 4-7  ( or for a 60-way guest
binding them to nodes 2-7)

/etc/qemu-ifup tap0

numactl --cpunodebind=4,5,6,7 --membind=4,5,6,7
/usr/local/bin/qemu-system-x86_64 -enable-kvm -cpu 
Westmere,+rdtscp,+pdpe1gb,+dca,+xtpr,+tm2,+est,+vmx,+ds_cpl,+monitor,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme

-enable-kvm \
-m 65536 -smp 40 \
-name vm1 -chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait

\
-drive 
file=/var/lib/libvirt/images/vmVinod1/vm1.img,if=none,id=drive-virtio-disk0,format=qcow2,cache=none

-device virtio-blk-pci,scsi=off,bus=pci
.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
-monitor stdio \
-net nic,macaddr=..mac_addr..  \
-net tap,ifname=tap0,script=no,downscript=no \
-vnc :4

/etc/qemu-ifdown tap0


I knew that there will be a few additional temporary qemu worker
threads created...  i.e. some over
subscription  will be there.


4 nodes above have 40 real cores, yes?


Yes .
Other than the qemu's related threads and some of the generic per-cpu 
Linux kernel threads (e.g. migration  etc)

there isn't anything else running on these Numa nodes.


Can you try to run upstream
kernel without binding at all and check the performance?




Re-ran the same workload *without* binding the qemu...but using the 
3.3.1 kernel


20-way guest: Performance got much worse when compared to the case where 
bind the qemu.

40-way guest: about the same as in the case  where we bind the qemu
60-way guest: about the same as in the case  where we bind the qemu

Trying out a couple of other experiments...

FYI
Vinod



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Performance of 40-way guest running 2.6.32-220 (RHEL6.2) vs. 3.3.1 OS

2012-04-17 Thread Gleb Natapov
On Mon, Apr 16, 2012 at 07:44:39AM -0700, Chegu Vinod wrote:
 On 4/16/2012 5:18 AM, Gleb Natapov wrote:
 On Thu, Apr 12, 2012 at 02:21:06PM -0400, Rik van Riel wrote:
 On 04/11/2012 01:21 PM, Chegu Vinod wrote:
 Hello,
 
 While running an AIM7 (workfile.high_systime) in a single 40-way (or a 
 single
 60-way KVM guest) I noticed pretty bad performance when the guest was 
 booted
 with 3.3.1 kernel when compared to the same guest booted with 2.6.32-220
 (RHEL6.2) kernel.
 For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x better 
 than
 the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run the older 
 guest
 kernel was nearly 12x better !
 How many CPUs your host has?
 
 80 Cores on the DL980.  (i.e. 8 Westmere sockets).
 
So you are not oversubscribing CPUs at all. Are those real cores or including 
HT?
Do you have other cpus hogs running on the host while testing the guest?

 I was using numactl to bind the qemu of the 40-way guests to numa
 nodes : 4-7  ( or for a 60-way guest
 binding them to nodes 2-7)
 
 /etc/qemu-ifup tap0
 
 numactl --cpunodebind=4,5,6,7 --membind=4,5,6,7
 /usr/local/bin/qemu-system-x86_64 -enable-kvm -cpu 
 Westmere,+rdtscp,+pdpe1gb,+dca,+xtpr,+tm2,+est,+vmx,+ds_cpl,+monitor,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
 -enable-kvm \
 -m 65536 -smp 40 \
 -name vm1 -chardev 
 socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait
 \
 -drive 
 file=/var/lib/libvirt/images/vmVinod1/vm1.img,if=none,id=drive-virtio-disk0,format=qcow2,cache=none
 -device virtio-blk-pci,scsi=off,bus=pci
 .0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
 -monitor stdio \
 -net nic,macaddr=..mac_addr.. \
 -net tap,ifname=tap0,script=no,downscript=no \
 -vnc :4
 
 /etc/qemu-ifdown tap0
 
 
 I knew that there will be a few additional temporary qemu worker
 threads created...  i.e. some over
 subscription  will be there.
 
4 nodes above have 40 real cores, yes? Can you try to run upstream
kernel without binding at all and check the performance?

 
 Will have to retry by doing some explicit pinning of the vcpus to
 native cores (without using virsh).
 
 Turned on function tracing and found that there appears to be more time 
 being
 spent around the lock code in the 3.3.1 guest when compared to the 
 2.6.32-220
 guest.
 Looks like you may be running into the ticket spinlock
 code. During the early RHEL 6 days, Gleb came up with a
 patch to automatically disable ticket spinlocks when
 running inside a KVM guest.
 
 IIRC that patch got rejected upstream at the time,
 with upstream developers preferring to wait for a
 better solution.
 
 If such a better solution is not on its way upstream
 now (two years later), maybe we should just merge
 Gleb's patch upstream for the time being?
 I think the pv spinlock that is actively discussed currently should
 address the issue, but I am not sure someone tests it against non-ticket
 lock in a guest to see which one performs better.
 
 I did see that discussion...seems to have originated from the Xen context.
 
Yes, The problem is the same for both hypervisors.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Performance of 40-way guest running 2.6.32-220 (RHEL6.2) vs. 3.3.1 OS

2012-04-17 Thread Chegu Vinod

On 4/17/2012 2:49 AM, Gleb Natapov wrote:

On Mon, Apr 16, 2012 at 07:44:39AM -0700, Chegu Vinod wrote:

On 4/16/2012 5:18 AM, Gleb Natapov wrote:

On Thu, Apr 12, 2012 at 02:21:06PM -0400, Rik van Riel wrote:

On 04/11/2012 01:21 PM, Chegu Vinod wrote:

Hello,

While running an AIM7 (workfile.high_systime) in a single 40-way (or a single
60-way KVM guest) I noticed pretty bad performance when the guest was booted
with 3.3.1 kernel when compared to the same guest booted with 2.6.32-220
(RHEL6.2) kernel.
For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x better than
the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run the older guest
kernel was nearly 12x better !

How many CPUs your host has?

80 Cores on the DL980.  (i.e. 8 Westmere sockets).


So you are not oversubscribing CPUs at all. Are those real cores or including 
HT?


HT is off.


Do you have other cpus hogs running on the host while testing the guest?


Nope.  Sometimes I do run the utilities like perf or sar or mpstat 
on the numa node 0 (where

the guest is not running).




I was using numactl to bind the qemu of the 40-way guests to numa
nodes : 4-7  ( or for a 60-way guest
binding them to nodes 2-7)

/etc/qemu-ifup tap0

numactl --cpunodebind=4,5,6,7 --membind=4,5,6,7
/usr/local/bin/qemu-system-x86_64 -enable-kvm -cpu 
Westmere,+rdtscp,+pdpe1gb,+dca,+xtpr,+tm2,+est,+vmx,+ds_cpl,+monitor,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
-enable-kvm \
-m 65536 -smp 40 \
-name vm1 -chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait
\
-drive 
file=/var/lib/libvirt/images/vmVinod1/vm1.img,if=none,id=drive-virtio-disk0,format=qcow2,cache=none
-device virtio-blk-pci,scsi=off,bus=pci
.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
-monitor stdio \
-net nic,macaddr=..mac_addr..  \
-net tap,ifname=tap0,script=no,downscript=no \
-vnc :4

/etc/qemu-ifdown tap0


I knew that there will be a few additional temporary qemu worker
threads created...  i.e. some over
subscription  will be there.


4 nodes above have 40 real cores, yes?


Yes .
Other than the qemu's related threads and some of the generic per-cpu 
Linux kernel threads (e.g. migration  etc)

there isn't anything else running on these Numa nodes.


Can you try to run upstream
kernel without binding at all and check the performance?



I shall re-run and get back to you with this info.

Typically for the native runs... binding the workload results in better 
numbers.  Hence I choose to do the
binding for the guest too...i.e. on the same numa nodes as the native 
case for virt. vs. native comparison
purposes. Having said that ...In the past I had seen a couple of cases 
where the non-binded guest
performed better than the native case. Need to re-run and dig into this 
further...





Will have to retry by doing some explicit pinning of the vcpus to
native cores (without using virsh).


Turned on function tracing and found that there appears to be more time being
spent around the lock code in the 3.3.1 guest when compared to the 2.6.32-220
guest.

Looks like you may be running into the ticket spinlock
code. During the early RHEL 6 days, Gleb came up with a
patch to automatically disable ticket spinlocks when
running inside a KVM guest.

IIRC that patch got rejected upstream at the time,
with upstream developers preferring to wait for a
better solution.

If such a better solution is not on its way upstream
now (two years later), maybe we should just merge
Gleb's patch upstream for the time being?

I think the pv spinlock that is actively discussed currently should
address the issue, but I am not sure someone tests it against non-ticket
lock in a guest to see which one performs better.

I did see that discussion...seems to have originated from the Xen context.


Yes, The problem is the same for both hypervisors.

--
Gleb.


Thanks
Vinod

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Performance of 40-way guest running 2.6.32-220 (RHEL6.2) vs. 3.3.1 OS

2012-04-16 Thread Gleb Natapov
On Thu, Apr 12, 2012 at 02:21:06PM -0400, Rik van Riel wrote:
 On 04/11/2012 01:21 PM, Chegu Vinod wrote:
 
 Hello,
 
 While running an AIM7 (workfile.high_systime) in a single 40-way (or a single
 60-way KVM guest) I noticed pretty bad performance when the guest was booted
 with 3.3.1 kernel when compared to the same guest booted with 2.6.32-220
 (RHEL6.2) kernel.
 
 For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x better than
 the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run the older 
 guest
 kernel was nearly 12x better !
 
How many CPUs your host has?

 Turned on function tracing and found that there appears to be more time being
 spent around the lock code in the 3.3.1 guest when compared to the 2.6.32-220
 guest.
 
 Looks like you may be running into the ticket spinlock
 code. During the early RHEL 6 days, Gleb came up with a
 patch to automatically disable ticket spinlocks when
 running inside a KVM guest.
 
 IIRC that patch got rejected upstream at the time,
 with upstream developers preferring to wait for a
 better solution.
 
 If such a better solution is not on its way upstream
 now (two years later), maybe we should just merge
 Gleb's patch upstream for the time being?
I think the pv spinlock that is actively discussed currently should
address the issue, but I am not sure someone tests it against non-ticket
lock in a guest to see which one performs better.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Performance of 40-way guest running 2.6.32-220 (RHEL6.2) vs. 3.3.1 OS

2012-04-16 Thread Chegu Vinod

On 4/16/2012 5:18 AM, Gleb Natapov wrote:

On Thu, Apr 12, 2012 at 02:21:06PM -0400, Rik van Riel wrote:

On 04/11/2012 01:21 PM, Chegu Vinod wrote:

Hello,

While running an AIM7 (workfile.high_systime) in a single 40-way (or a single
60-way KVM guest) I noticed pretty bad performance when the guest was booted
with 3.3.1 kernel when compared to the same guest booted with 2.6.32-220
(RHEL6.2) kernel.
For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x better than
the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run the older guest
kernel was nearly 12x better !

How many CPUs your host has?


80 Cores on the DL980.  (i.e. 8 Westmere sockets).

I was using numactl to bind the qemu of the 40-way guests to numa nodes 
: 4-7  ( or for a 60-way guest

binding them to nodes 2-7)

/etc/qemu-ifup tap0

numactl --cpunodebind=4,5,6,7 --membind=4,5,6,7 
/usr/local/bin/qemu-system-x86_64 -enable-kvm -cpu 
Westmere,+rdtscp,+pdpe1gb,+dca,+xtpr,+tm2,+est,+vmx,+ds_cpl,+monitor,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme 
-enable-kvm \

-m 65536 -smp 40 \
-name vm1 -chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait 
\
-drive 
file=/var/lib/libvirt/images/vmVinod1/vm1.img,if=none,id=drive-virtio-disk0,format=qcow2,cache=none 
-device virtio-blk-pci,scsi=off,bus=pci

.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
-monitor stdio \
-net nic,macaddr=..mac_addr.. \
-net tap,ifname=tap0,script=no,downscript=no \
-vnc :4

/etc/qemu-ifdown tap0


I knew that there will be a few additional temporary qemu worker threads 
created...  i.e. some over

subscription  will be there.


Will have to retry by doing some explicit pinning of the vcpus to native 
cores (without using virsh).



Turned on function tracing and found that there appears to be more time being
spent around the lock code in the 3.3.1 guest when compared to the 2.6.32-220
guest.

Looks like you may be running into the ticket spinlock
code. During the early RHEL 6 days, Gleb came up with a
patch to automatically disable ticket spinlocks when
running inside a KVM guest.

IIRC that patch got rejected upstream at the time,
with upstream developers preferring to wait for a
better solution.

If such a better solution is not on its way upstream
now (two years later), maybe we should just merge
Gleb's patch upstream for the time being?

I think the pv spinlock that is actively discussed currently should
address the issue, but I am not sure someone tests it against non-ticket
lock in a guest to see which one performs better.


I did see that discussion...seems to have originated from the Xen context.

Vinod



--
Gleb.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Performance of 40-way guest running 2.6.32-220 (RHEL6.2) vs. 3.3.1 OS

2012-04-15 Thread Chegu Vinod
Rik van Riel riel at redhat.com writes:

 
 On 04/11/2012 01:21 PM, Chegu Vinod wrote:
 
  Hello,
 
  While running an AIM7 (workfile.high_systime) in a single 40-way (or a 
single
  60-way KVM guest) I noticed pretty bad performance when the guest was booted
  with 3.3.1 kernel when compared to the same guest booted with 2.6.32-220
  (RHEL6.2) kernel.
 
  For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x better 
than
  the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run the older 
guest
  kernel was nearly 12x better !
 
  Turned on function tracing and found that there appears to be more time 
being
  spent around the lock code in the 3.3.1 guest when compared to the 2.6.32-
220
  guest.
 
 Looks like you may be running into the ticket spinlock
 code. During the early RHEL 6 days, Gleb came up with a
 patch to automatically disable ticket spinlocks when
 running inside a KVM guest.
 

Thanks for the pointer. 
Perhaps that is the issue.  
I did look up that old discussion thread.


 IIRC that patch got rejected upstream at the time,
 with upstream developers preferring to wait for a
 better solution.
 
 If such a better solution is not on its way upstream
 now (two years later), maybe we should just merge
 Gleb's patch upstream for the time being?



Also noticed a recent discussion thread (that originated from the Xen context)

http://article.gmane.org/gmane.linux.kernel.virtualization/15078

Not yet sure if this recent discussion is also in some way related to
the older one initiated by Gleb.

Thanks
Vinod



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Performance of 40-way guest running 2.6.32-220 (RHEL6.2) vs. 3.3.1 OS

2012-04-12 Thread Rik van Riel

On 04/11/2012 01:21 PM, Chegu Vinod wrote:


Hello,

While running an AIM7 (workfile.high_systime) in a single 40-way (or a single
60-way KVM guest) I noticed pretty bad performance when the guest was booted
with 3.3.1 kernel when compared to the same guest booted with 2.6.32-220
(RHEL6.2) kernel.



For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x better than
the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run the older guest
kernel was nearly 12x better !



Turned on function tracing and found that there appears to be more time being
spent around the lock code in the 3.3.1 guest when compared to the 2.6.32-220
guest.


Looks like you may be running into the ticket spinlock
code. During the early RHEL 6 days, Gleb came up with a
patch to automatically disable ticket spinlocks when
running inside a KVM guest.

IIRC that patch got rejected upstream at the time,
with upstream developers preferring to wait for a
better solution.

If such a better solution is not on its way upstream
now (two years later), maybe we should just merge
Gleb's patch upstream for the time being?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Performance of 40-way guest running 2.6.32-220 (RHEL6.2) vs. 3.3.1 OS

2012-04-11 Thread Chegu Vinod

Hello,

While running an AIM7 (workfile.high_systime) in a single 40-way (or a single 
60-way KVM guest) I noticed pretty bad performance when the guest was booted 
with 3.3.1 kernel when compared to the same guest booted with 2.6.32-220 
(RHEL6.2) kernel.

'am still trying to dig more into the details here. Wondering if some changes 
in 
the upstream kernel (i.e. since 2.6.32-220) might be causing this to show up in 
a guest environment (esp. for this system-intensive workload).  

Has anyone else observed this kind of behavior ? Is it a known issue with a fix 
in the pipeline ? If not are there any special knobs/tunables that one needs to 
explicitly set/clear etc. when using newer kernels like 3.3.1 in a guest ? 

I have included some info. below. 

Also any pointers on what else I could capture that would be helpful.

Thanks!
Vinod

---

Platform used:
DL980 G7 (80 cores + 128G RAM).  Hyper-threading is turned off.

Workload used:
AIM7  (workfile.high_systime) and using RAM disks. This is 
primarily a cpu intensive workload...not much i/o. 

Software used :
qemu-system-x86_64   :  1.0.50(i.e. latest as of about a week or so ago).
Native/Host  OS  :  3.3.1 (SLUB allocator explicitly enabled)
Guest-RunA   OS  :  2.6.32-220 (i.e. RHEL6.2 kernel)
Guest-RunB   OS  :  3.3.1

Guest was pinned on :
numa node: 4,5,6,7   -   40VCPUs + 64G   (i.e. 40-way guest)
numa node: 2,3,4,5,7  -  60VCPUs + 96G   (i.e. 60-way guest)

For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x better than 
the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run the older guest 
kernel was nearly 12x better !

For the Guest-RunB (3.3.1) case I ran mpstat -P ALL 1 on the host and 
observed 
that a very high % of time was being spent by the CPUs outside the guest mode 
and mostly in the host (i.e.  sys). Looking at the perf related traces it 
seemed like there were long pauses in the guest perhaps waiting for the 
zone-lru_lock as part of release_pages() and this resulted in the VT's PLE 
related code to kick-in on the host.

Turned on function tracing and found that there appears to be more time being
spent around the lock code in the 3.3.1 guest when compared to the 2.6.32-220
guest.  Here is a small sampling of these traces... Notice the time stamp jump 
around _spin_lock_irqsave -release_pages in the case of Guest-RunB. 


1) 40-way Guest-RunA (2.6.32-220 kernel):
-


#   TASK-PID   CPU#  TIMESTAMP  FUNCTION

   ...-32147 [020] 145783.127452: native_flush_tlb -flush_tlb_mm
   ...-32147 [020] 145783.127452: free_pages_and_swap_cache -
unmap_region
   ...-32147 [020] 145783.127452: lru_add_drain -
free_pages_and_swap_cache
   ...-32147 [020] 145783.127452: release_pages -
free_pages_and_swap_cache
   ...-32147 [020] 145783.127452: _spin_lock_irqsave -release_pages
   ...-32147 [020] 145783.127452: __mod_zone_page_state -
release_pages
   ...-32147 [020] 145783.127452: mem_cgroup_del_lru_list -
release_pages

...

   ...-32147 [022] 145783.133536: release_pages -
free_pages_and_swap_cache
   ...-32147 [022] 145783.133536: _spin_lock_irqsave -release_pages
   ...-32147 [022] 145783.133536: __mod_zone_page_state -
release_pages
   ...-32147 [022] 145783.133536: mem_cgroup_del_lru_list -
release_pages
   ...-32147 [022] 145783.133537: lookup_page_cgroup -
mem_cgroup_del_lru_list




2) 40-way Guest-RunB (3.3.1):
-


#   TASK-PID   CPU#  TIMESTAMP  FUNCTION
   ...-16459 [009]  101757.383125: free_pages_and_swap_cache -
tlb_flush_mmu
   ...-16459 [009]  101757.383125: lru_add_drain -
free_pages_and_swap_cache
   ...-16459 [009]  101757.383125: release_pages -
free_pages_and_swap_cache
   ...-16459 [009]  101757.383125: _raw_spin_lock_irqsave -
release_pages
   ...-16459 [009] d... 101757.384861: mem_cgroup_lru_del_list -
release_pages
   ...-16459 [009] d... 101757.384861: lookup_page_cgroup -
mem_cgroup_lru_del_list




   ...-16459 [009] .N.. 101757.390385: release_pages -
free_pages_and_swap_cache
   ...-16459 [009] .N.. 101757.390385: _raw_spin_lock_irqsave -
release_pages
   ...-16459 [009] dN.. 101757.392983: mem_cgroup_lru_del_list -
release_pages
   ...-16459 [009] dN.. 101757.392983: lookup_page_cgroup -
mem_cgroup_lru_del_list
   ...-16459 [009] dN.. 101757.392983: __mod_zone_page_state -
release_pages




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html