[PATCH 1/2] mark hyper-v hypercall page as dirty

2014-01-22 Thread Vadim Rozenfeld
Signed-off-by: Vadim Rozenfeld 
---
 arch/x86/kvm/x86.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 59b95b1..e8599ed 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1840,6 +1840,7 @@ static int set_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 
msr, u64 data)
if (__copy_to_user((void __user *)addr, instructions, 4))
return 1;
kvm->arch.hv_hypercall = data;
+   mark_page_dirty(kvm, gfn);
break;
}
case HV_X64_MSR_REFERENCE_TSC: {
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] mark hyper-v vapic assist page as dirty

2014-01-22 Thread Vadim Rozenfeld
Signed-off-by: Vadim Rozenfeld 
---
 arch/x86/kvm/x86.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e8599ed..cd8a41f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1869,19 +1869,21 @@ static int set_msr_hyperv(struct kvm_vcpu *vcpu, u32 
msr, u64 data)
 {
switch (msr) {
case HV_X64_MSR_APIC_ASSIST_PAGE: {
+   u64 gfn;
unsigned long addr;
 
if (!(data & HV_X64_MSR_APIC_ASSIST_PAGE_ENABLE)) {
vcpu->arch.hv_vapic = data;
break;
}
-   addr = gfn_to_hva(vcpu->kvm, data >>
- HV_X64_MSR_APIC_ASSIST_PAGE_ADDRESS_SHIFT);
+   gfn = data >> HV_X64_MSR_APIC_ASSIST_PAGE_ADDRESS_SHIFT;
+   addr = gfn_to_hva(vcpu->kvm, gfn);
if (kvm_is_error_hva(addr))
return 1;
if (__clear_user((void __user *)addr, PAGE_SIZE))
return 1;
vcpu->arch.hv_vapic = data;
+   mark_page_dirty(vcpu->kvm, gfn);
break;
}
case HV_X64_MSR_EOI:
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] Mark Hyper-V hypercall and vapic assist pages as dirty

2014-01-22 Thread Vadim Rozenfeld
Vadim Rozenfeld (2):
  mark hyper-v hypercall page as dirty
  mark hyper-v vapic assist page as dirty

 arch/x86/kvm/x86.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-ppc] KVM and variable-endianness guest CPUs

2014-01-22 Thread Victor Kamensky
Hi Alex,

Sorry, for delayed reply, I was focusing on discussion
with Peter. Hope you and other folks may get something
out of it :).

Please see responses inline

On 22 January 2014 02:52, Alexander Graf  wrote:
>
> On 22.01.2014, at 08:26, Victor Kamensky  wrote:
>
>> On 21 January 2014 22:41, Alexander Graf  wrote:
>>>
>>>
>>> "Native endian" really is just a shortcut for "target endian"
>>> which is LE for ARM and BE for PPC. There shouldn't be
>>> a qemu-system-armeb or qemu-system-ppc64le.
>>
>> I disagree. Fully functional ARM BE system is what we've
>> been working on for last few months. 'We' is Linaro
>> Networking Group, Endian subteam and some other guys
>> in ARM and across community. Why we do that is a bit
>> beyond of this discussion.
>>
>> ARM BE patches for both V7 and V8 are already in mainline
>> kernel. But ARM BE KVM host is broken now. It is known
>> deficiency that I am trying to fix. Please look at [1]. Patches
>> for V7 BE KVM were proposed and currently under active
>> discussion. Currently I work on ARM V8 BE KVM changes.
>>
>> So "native endian" in ARM is value of CPSR register E bit.
>> If it is off native endian is LE, if it is on it is BE.
>>
>> Once and if we agree on ARM BE KVM host changes, the
>> next step would be patches in qemu one of which introduces
>> qemu-system-armeb. Please see [2].
>
> I think we're facing an ideology conflict here. Yes, there
> should be a qemu-system-arm that is BE capable.

Maybe it is not ideology conflict but rather terminology clarity
issue :). I am not sure what do you mean by "qemu-system-arm
that is BE capable". In qemu build system there is just target
name 'arm', which is ARM V7 cpu in LE mode, and 'armeb'
target which is ARM V7 cpu in BE mode. That is true for a lot
of open source packages. You could check [1] patch that
introduces armeb target into qemu. Build for
arm target produces qemu-system-arm executable that is
marked 'ELF 32-bit LSB executable' and it could run on LE
traditional ARM Linux. Build for armeb target produces
qemu-system-armeb executable that is marked 'ELF 32-bit
MSB executable' that can run on BE ARM Linux. armbe is
nothing special here, just build option for qemu that should run
on BE ARM Linux.

Both qemu-system-arm and qemu-system-armeb should
be BE/LE capable. I.e either of them along with KVM could
either run LE or BE guest. MarcZ demonstrated that this
is possible. I've tested both LE and BE guests with
qemu-system-arm running on traditional LE ARM Linux,
effectively repeating Marc's setup but with qemu.
And I did test with my patches both BE and LE guests with
qemu-system-armeb running on BE ARM Linux.

> There
> should also be a qemu-system-ppc64 that is LE capable.
> But there is no point in changing the "default endiannes"
> for the virtual CPUs that we plug in there. Both CPUs are
> perfectly capable of running in LE or BE mode, the
> question is just what we declare the "default".

I am not sure, what you mean by "default"? Is it initial
setting of CPSR E bit and 'cp15 c1, c0, 0' EE bit? Yes,
the way it is currently implemented by committed
qemu-system-arm, and proposed qemu-system-armeb
patches, they are both off. I.e even qemu-system-armeb
starts running vcpu in LE mode, exactly by very similar
reason as desribed in your next paragraph
qemu-system-armeb has tiny bootloader that starts
in LE mode, jumps to kernel kernel switches cpu to
run in BE mode 'setend be' and EE bit is set just
before mmu is enabled.

> Think about the PPC bootstrap. We start off with a
> BE firmware, then boot into the Linux kernel which
> calls a hypercall to set the LE bit on every interrupt.

We have very similar situation with BE ARM Linux.
When we run ARM BE Linux we start with bootloader
which is LE and then CPU issues 'setend be' very
soon as it starts executing kernel code, all secondary
CPUs issue 'setend be' when they go out of reset pen
or bootmonitor sleep.

> But there's no reason this little endian kernel
> couldn't theoretically have big endian user space running
> with access to emulated device registers.

I don't want to go there, it is very very messy ...

-- Just a side note: --
Interestingly, half a year before I joined Linaro in Cisco I and
my colleague implemented kernel patch that allowed to run
BE user-space processes as sort of separate personality on
top of LE ARM kernel ... treated kind of multi-abi system.
Effectively we had to do byteswaps on all non-trivial
system calls and ioctls in side of the kernel. We converted
around 30 system calls and around 10 ioctls. Our target process
was just using those and it works working, but patch was
very intrusive and unnatural. I think in Linaro there was
some public version of my presentation circulated that
explained all this mess. I don't want seriously to consider it.

The only robust mixed mode, as MarcZ demonstrated,
could be done only on VM boundaries. I.e LE host can
run BE guest fine. And BE host can run LE guest fine.
Everything else would

Re: kvm virtio ethernet ring on guest side over high throughput (packet per second)

2014-01-22 Thread Jason Wang
On 01/23/2014 05:32 AM, Alejandro Comisario wrote:
> Thank you so much Stefan for the help and cc'ing Michael & Jason.
> Like you advised yesterday on IRC, today we are making some tests with
> the application setting TCP_NODELAY in the socket options.
>
> So we will try that and get back to you with further information.
> In the mean time, maybe showing what options the vms are using while running !
>
> # 
> --
> /usr/bin/kvm -S -M pc-1.0 -cpu
> core2duo,+lahf_lm,+rdtscp,+pdpe1gb,+aes,+popcnt,+x2apic,+sse4.2,+sse4.1,+dca,+xtpr,+cx16,+tm2,+est,+vmx,+ds_cpl,+pbe,+tm,+ht,+ss,+acpi,+ds
> -enable-kvm -m 32768 -smp 8,sockets=1,cores=6,threads=2 -name
> instance-0254 -uuid d25b1b20-409e-4d7f-bd92-2ef4073c7c2b
> -nodefconfig -nodefaults -chardev
> socket,id=charmonitor,path=/var/lib/libvirt/qemu/instance-0254.monitor,server,nowait
> -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc
> -no-shutdown -kernel /var/lib/nova/instances/instance-0254/kernel
> -initrd /var/lib/nova/instances/instance-0254/ramdisk -append
> root=/dev/vda console=ttyS0 -drive
> file=/var/lib/nova/instances/instance-0254/disk,if=none,id=drive-virtio-disk0,format=qcow2,cache=writethrough
> -device 
> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
> -netdev tap,fd=19,id=hostnet0 -device

Better enable vhost as Stefan suggested. It may help a lot here.
> virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:27:d4:6d,bus=pci.0,addr=0x3
> -chardev 
> file,id=charserial0,path=/var/lib/nova/instances/instance-0254/console.log
> -device isa-serial,chardev=charserial0,id=serial0 -chardev
> pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1
> -usb -device usb-tablet,id=input0 -vnc 0.0.0.0:4 -k en-us -vga cirrus
> -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5
> # 
> --
>
> best regards
>
>
> Alejandro Comisario
> #melicloud CloudBuilders
> Arias 3751, Piso 7 (C1430CRG)
> Ciudad de Buenos Aires - Argentina
> Cel: +549(11) 15-3770-1857
> Tel : +54(11) 4640-8443
>
>
> On Wed, Jan 22, 2014 at 12:22 PM, Stefan Hajnoczi  wrote:
>> On Tue, Jan 21, 2014 at 04:06:05PM -0200, Alejandro Comisario wrote:
>>
>> CCed Michael Tsirkin and Jason Wang who work on KVM networking.
>>
>>> Hi guys, we had in the past when using physical servers, several
>>> throughput issues regarding the throughput of our APIS, in our case we
>>> measure this with packets per seconds, since we dont have that much
>>> bandwidth (Mb/s) since our apis respond lots of packets very small
>>> ones (maximum response of 3.5k and avg response of 1.5k), when we
>>> where using this physical servers, when we reach throughput capacity
>>> (due to clients tiemouts) we touched the ethernet ring configuration
>>> and we made the problem dissapear.
>>>
>>> Today with kvm and over 10k virtual instances, when we want to
>>> increase the throughput of KVM instances, we bumped with the fact that
>>> when using virtio on guests, we have a max configuration of the ring
>>> of 256 TX/RX, and from the host side the atached vnet has a txqueuelen
>>> of 500.
>>>
>>> What i want to know is, how can i tune the guest to support more
>>> packets per seccond if i know that's my bottleneck?
>> I suggest investigating performance in a systematic way.  Set up a
>> benchmark that saturates the network.  Post the details of the benchmark
>> and the results that you are seeing.
>>
>> Then, we can discuss how to investigate the root cause of the bottleneck.
>>
>>> * does virtio exposes more packets to configure in the virtual ethernet's 
>>> ring ?
>> No, ring size is hardcoded in QEMU (on the host).
>>
>>> * does the use of vhost_net helps me with increasing packets per
>>> second and not only bandwidth?
>> vhost_net is generally the most performant network option.
>>
>>> does anyone has to struggle with this before and knows where i can look 
>>> into ?
>>> there's LOOOTS of information about networking performance
>>> tuning of kvm, but nothing related to increase throughput in pps
>>> capacity.
>>>
>>> This is a couple of configurations that we are having right now on the
>>> compute nodes:
>>>
>>> * 2x1Gb bonded interfaces (want to know the more than 20 models we are
>>> using, just ask for it)
>>> * Multi queue interfaces, pined via irq to different cores
>>> * Linux bridges,  no VLAN, no open-vswitch
>>> * ubuntu 12.04 kernel 3.2.0-[40-48]
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubsc

Re: kvm virtio ethernet ring on guest side over high throughput (packet per second)

2014-01-22 Thread Jason Wang
On 01/22/2014 11:22 PM, Stefan Hajnoczi wrote:
> On Tue, Jan 21, 2014 at 04:06:05PM -0200, Alejandro Comisario wrote:
>
> CCed Michael Tsirkin and Jason Wang who work on KVM networking.
>
>> Hi guys, we had in the past when using physical servers, several
>> throughput issues regarding the throughput of our APIS, in our case we
>> measure this with packets per seconds, since we dont have that much
>> bandwidth (Mb/s) since our apis respond lots of packets very small
>> ones (maximum response of 3.5k and avg response of 1.5k), when we
>> where using this physical servers, when we reach throughput capacity
>> (due to clients tiemouts) we touched the ethernet ring configuration
>> and we made the problem dissapear.
>>
>> Today with kvm and over 10k virtual instances, when we want to
>> increase the throughput of KVM instances, we bumped with the fact that
>> when using virtio on guests, we have a max configuration of the ring
>> of 256 TX/RX, and from the host side the atached vnet has a txqueuelen
>> of 500.
>>
>> What i want to know is, how can i tune the guest to support more
>> packets per seccond if i know that's my bottleneck?
> I suggest investigating performance in a systematic way.  Set up a
> benchmark that saturates the network.  Post the details of the benchmark
> and the results that you are seeing.
>
> Then, we can discuss how to investigate the root cause of the bottleneck.
>
>> * does virtio exposes more packets to configure in the virtual ethernet's 
>> ring ?
> No, ring size is hardcoded in QEMU (on the host).

Do it make sense to let user can configure it through something at least
like qemu command line?
>
>> * does the use of vhost_net helps me with increasing packets per
>> second and not only bandwidth?
> vhost_net is generally the most performant network option.
>
>> does anyone has to struggle with this before and knows where i can look into 
>> ?
>> there's LOOOTS of information about networking performance
>> tuning of kvm, but nothing related to increase throughput in pps
>> capacity.
>>
>> This is a couple of configurations that we are having right now on the
>> compute nodes:
>>
>> * 2x1Gb bonded interfaces (want to know the more than 20 models we are
>> using, just ask for it)
>> * Multi queue interfaces, pined via irq to different cores

Maybe you can have a try with multiqueue virtio-net with vhost. It can
let guest to use more than one tx/rx virtqueue pairs to do the network
processing.
>> * Linux bridges,  no VLAN, no open-vswitch
>> * ubuntu 12.04 kernel 3.2.0-[40-48]

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] sched: tip/master show soft lockup while running multiple VM

2014-01-22 Thread Michael wang
On 01/22/2014 08:36 PM, Peter Zijlstra wrote:
> On Wed, Jan 22, 2014 at 04:27:45PM +0800, Michael wang wrote:
>> # CONFIG_PREEMPT_NONE is not set
>> CONFIG_PREEMPT_VOLUNTARY=y
>> # CONFIG_PREEMPT is not set
> 
> Could you try the patch here:
> 
>   lkml.kernel.org/r/20140122102435.gh31...@twins.programming.kicks-ass.net
> 
> I suspect its the same issue.

Yup, it works.

Regards,
Michael Wang

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 00/73] tree-wide: clean up some no longer required #include

2014-01-22 Thread Paul Gortmaker
[Re: [PATCH RFC 00/73] tree-wide: clean up some no longer required #include 
] On 22/01/2014 (Wed 18:00) Stephen Rothwell wrote:

> Hi Paul,
> 
> On Tue, 21 Jan 2014 16:22:03 -0500 Paul Gortmaker 
>  wrote:
> >
> > Where: This work exists as a queue of patches that I apply to
> > linux-next; since the changes are fixing some things that currently
> > can only be found there.  The patch series can be found at:
> > 
> >http://git.kernel.org/cgit/linux/kernel/git/paulg/init.git
> >git://git.kernel.org/pub/scm/linux/kernel/git/paulg/init.git
> > 
> > I've avoided annoying Stephen with another queue of patches for
> > linux-next while the development content was in flux, but now that
> > the merge window has opened, and new additions are fewer, perhaps he
> > wouldn't mind tacking it on the end...  Stephen?
> 
> OK, I have added this to the end of linux-next today - we will see how we
> go.  It is called "init".

Thanks, it was a great help as it uncovered a few issues in fringe arch
that I didn't have toolchains for, and I've fixed all of those up.

I've noticed that powerpc has been un-buildable for a while now; I have
used this hack patch locally so I could run the ppc defconfigs to check
that I didn't break anything.  Maybe useful for linux-next in the
interim?  It is a hack patch -- Not-Signed-off-by: Paul Gortmaker.  :)

Paul.
--

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h 
b/arch/powerpc/include/asm/pgtable-ppc64.h
index d27960c89a71..d0f070a2b395 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -560,9 +560,9 @@ extern void pmdp_invalidate(struct vm_area_struct *vma, 
unsigned long address,
pmd_t *pmdp);
 
 #define pmd_move_must_withdraw pmd_move_must_withdraw
-typedef struct spinlock spinlock_t;
-static inline int pmd_move_must_withdraw(spinlock_t *new_pmd_ptl,
-spinlock_t *old_pmd_ptl)
+struct spinlock;
+static inline int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl,
+struct spinlock *old_pmd_ptl)
 {
/*
 * Archs like ppc64 use pgtable to store per pmd

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM and variable-endianness guest CPUs

2014-01-22 Thread Victor Kamensky
Peter, could I please ask you a favor. Could you please
stop deleting pieces of your and my previous responses
when you reply.
Please just reply inline. Sometimes I would like to
reference my or your previous statement, but I could not
find it in your response email. It is very bizzar. Sorry,
it will make your response email bigger, but I am very
confused otherwise.

On 22 January 2014 15:18, Peter Maydell  wrote:
> On 22 January 2014 22:47, Victor Kamensky  wrote:
>> You deleted my example, but I need it again:
>> Consider the following ARM code snippets:
>>
>> setend le
>> mov r1, #0x04030201
>> str r1, [r0]
>>
>> and
>>
>> setend be
>> mov r1, #0x01020304
>> str r1, [r0]
>>
>> Just for LE host case basically you are saying that if guest
>> issues 4 bytes store
>> instruction for CPU core register and CPSR E bit is off,
>> mmio.data[0] would contain LSB of integer from this CPU core
>> register. I don't understand your bus endianity thing, but I do
>> understand LSB of integer in core CPU register. Do we agree
>> that in above example in second case when BE access is
>> on (E bit is on), it is the exactly the same memory transaction
>> but it data[0] = 0x1 is MSB of integer in CPU core register (still the
>> same LE host case)?
>
> Yes, this is true both if we define mmio.data[] as "always
> little endian" and if we define it as "host kernel endianness",
> since you've specified an LE host here.

OK, we are in agreement here. mmio.data[] = { 0x1, 0x2, 0x3, 0x4}
for both types of guest access and host is LE  And as
far as bus concerned it is absolutely the same transaction.

> (The kernel has to byte swap if CPSR.E is set, because
> it has to emulate the byte-lane-swap the CPU hardware
> does internally before register data goes out to the bus.)
>
>> Consider above big endian case (setend be) example,
>> but now running in BE KVM host. 0x4 is LSB of CPU
>> core register in this case.
>
> Yes. In this case if we are using the "mmio.data is host
> kernel endianness" definition then mmio.data[0] should be
> 0x01 (the MSB of the 32 bit data value).

If mmio.data[0] is 0x1, mmio.data[] = {0x1, 0x2, 0x3, 0x4},
and now KVM host and emulator running in BE mode.
But that contradicts to what you said before. In previous
email (please see [1]). Here is what your and I just said in few
paragraphs before my "Consider above big endian .." above
paragraph:

- start quote --

>> BTW could you please propose how will you see such
>> "32 bit transaction, value 0x04030201, address $whatever".
>> on ARM LE CPU in mmio.data?
>
> That is exactly the problem we're discussing in this thread.
> Indeed, I proposed an answer to it, which is that the mmio.data
> array should be in host kernel byte order, in which case it
> would be (for an LE host kernel) 0x01 in mmio.data[0] and so
> on up.
>
>> If it would be {0x01, 0x02, 0x03, 0x4} it is fine with
>> me. That is current case ARM LE case when above
>> snippets would be executed by guest.
>>
>> Would we  agree that the same arrangement would be
>> true for all other cases on ARM regardless of all other
>> endianities of qemu, KVM host, guest, hypervisor, etc?
>
> No; under my proposal, for a big-endian host kernel (and
> thus big-endian QEMU) the order would be
> mmio.data[0] = 0x04, etc. (This wouldn't change based
> on the guest kernel endianness or whether it happened
> to have set the E bit temporarily.)

- end quote --

So in one case for the same memory transaction (ARM
setend be snippet) executed
under BE ARM host KVM you said that "mmio.data[0]
should be 0x01 (the MSB of the 32 bit data value)" and
before you said "No; under my proposal, for a big-endian
host kernel (and thus big-endian QEMU) the order would be
mmio.data[0] = 0x04, etc". So which is mmio.data[0]?

I argue that for all three code snippets in this email (two for
ARM and one for PPC) mmio.data[] = {0x1, 0x2, 0x3, 04},
and that does not depend whether it is LE ARM KVM host,
BE ARM KVM host, or BE PPC KVM.

> (Notice that the
> BE host kernel can actually just behave exactly like the LE
> one: byteswap 32 bit value from guest register if guest
> CPSR.E is set, then do a 32-bit store of the 32 bit word
> into mmio.data[].)
>
>>> Defining that mmio.data[] is always little-endian would
>>> be a valid definition of an API if we were doing it from
>>> scratch. It has the unfortunate property that it would
>>> completely break the existing PPC BE setups, which
>>> don't define it that way, so it is a non-starter.
>>
>> I believe, but I need to check, that PPC BE setup actually
>> acts as the second case in above example  If we have PPC
>> BE guest executing the following instructions:
>>
>> lis r1,0x102
>> ori r1,r1,0x304
>> stwr1,0(r0)
>>
>> after first two instructions r1 would contain 0x01020304.
>> IMHO It exactly corresponds to above my ARM second case -
>> BE guest when it runs under ARM BE KVM host. I believe
>> 

Re: KVM and variable-endianness guest CPUs

2014-01-22 Thread Peter Maydell
On 22 January 2014 22:47, Victor Kamensky  wrote:
> You deleted my example, but I need it again:
> Consider the following ARM code snippets:
>
> setend le
> mov r1, #0x04030201
> str r1, [r0]
>
> and
>
> setend be
> mov r1, #0x01020304
> str r1, [r0]
>
> Just for LE host case basically you are saying that if guest
> issues 4 bytes store
> instruction for CPU core register and CPSR E bit is off,
> mmio.data[0] would contain LSB of integer from this CPU core
> register. I don't understand your bus endianity thing, but I do
> understand LSB of integer in core CPU register. Do we agree
> that in above example in second case when BE access is
> on (E bit is on), it is the exactly the same memory transaction
> but it data[0] = 0x1 is MSB of integer in CPU core register (still the
> same LE host case)?

Yes, this is true both if we define mmio.data[] as "always
little endian" and if we define it as "host kernel endianness",
since you've specified an LE host here.

(The kernel has to byte swap if CPSR.E is set, because
it has to emulate the byte-lane-swap the CPU hardware
does internally before register data goes out to the bus.)

> Consider above big endian case (setend be) example,
> but now running in BE KVM host. 0x4 is LSB of CPU
> core register in this case.

Yes. In this case if we are using the "mmio.data is host
kernel endianness" definition then mmio.data[0] should be
0x01 (the MSB of the 32 bit data value). (Notice that the
BE host kernel can actually just behave exactly like the LE
one: byteswap 32 bit value from guest register if guest
CPSR.E is set, then do a 32-bit store of the 32 bit word
into mmio.data[].)

>> Defining that mmio.data[] is always little-endian would
>> be a valid definition of an API if we were doing it from
>> scratch. It has the unfortunate property that it would
>> completely break the existing PPC BE setups, which
>> don't define it that way, so it is a non-starter.
>
> I believe, but I need to check, that PPC BE setup actually
> acts as the second case in above example  If we have PPC
> BE guest executing the following instructions:
>
> lis r1,0x102
> ori r1,r1,0x304
> stwr1,0(r0)
>
> after first two instructions r1 would contain 0x01020304.
> IMHO It exactly corresponds to above my ARM second case -
> BE guest when it runs under ARM BE KVM host. I believe
> that mmio.data[] in PPC BE case would be {0x1, 0x2, 0x3, 0x4}.

Yes, assuming a BE PPC host kernel (which is the usual
arrangement).

> But according to you data[0] must be 0x4 in BE host case

Er, no. The data here is 0x01020304, so for a BE host
data[0] is the big end, ie 0x1. It would only be 0x4 if
mmio.data[] were LE always (or if you were running
your BE PPC guest on an LE PPC host, which I don't
think is supported currently).

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM and variable-endianness guest CPUs

2014-01-22 Thread Victor Kamensky
On 22 January 2014 12:02, Peter Maydell  wrote:
> On 22 January 2014 19:29, Victor Kamensky  wrote:
>> On 22 January 2014 09:29, Peter Maydell  wrote:
>>> This just isn't how real buses work. There is no
>>> "address + 1, address + 2". There is a single address
>>> for the memory transaction and a set of data on
>>> data lines and some separate size information.
>>
>> Yes, and those data lines are just binary signal lines
>> not numbers. If one would want to describe information
>> on data lines as number he/she needs to assign integer
>> bits numbers to lines, and that is absolutely arbitrary
>> process
>
> It is part of the definition of the bus which signal pin is
> D0 and which is D31...
>
>>  In one choose one way to assigned those
>> bits to lines and another choose reverse way, they will
>> talk about completely different numbers for the same
>> signals on the bus. Such data lines enumeration has
>> no reflection on how bus actually works. And I don't
>> even see why it should be described just as single
>> integer, for example one can describe information on
>> data lines as set of 4 byte value, nothing wrong with
>> such description.
>
> It is not how the hardware works. If you describe it as
> a set of 4 bytes, then you need to also say how you are
> mapping from those 4 bytes to the actual 32 bit data
> transaction the hardware is doing. Which is the question
> we're trying to answer in this thread.
>
> I've snipped a huge chunk of my initial reply to this email,
> because it all boiled down to "sorry, you're just not correct
> about how the hardware works" and it doesn't seem
> necessary to repeat it three times. Devices really do see
> "this is a transaction with this value and this size". They
> do not in any way see a 32 bit word write as "this is a collection
> of byte writes". Therefore:
>
>  1) thinking about a 32 bit word write in terms of a byte array
> is confusing
>  2) since the KVM API is unfortunately stuck with this byte
>array, we must define the semantics of what it actually
>contains, so that the kernel and QEMU can go between
>   "the value being read/written in the transaction" and
>   "the contents of the byte array
>
>>> As soon as you try to think of the mmio.data as a set
>>> of bytes then you have to specify some endianness to
>>> the data, so that both sides (kernel and userspace)
>>> know how to reconstruct the actual data value from the
>>> array of bytes.
>>
>> What actual value? In what sense? You need to bring
>> into discussion semantic of this h/w address to really tell
>> that.
>
> I've just spent the last two emails doing exactly that.
> The actual value, as in "this CPU just did a memory
> transaction of a 32 bit data value".

You deleted my example, but I need it again:
Consider the following ARM code snippets:

setend le
mov r1, #0x04030201
str r1, [r0]

and

setend be
mov r1, #0x01020304
str r1, [r0]

Just for LE host case basically you are saying that if guest
issues 4 bytes store
instruction for CPU core register and CPSR E bit is off,
mmio.data[0] would contain LSB of integer from this CPU core
register. I don't understand your bus endianity thing, but I do
understand LSB of integer in core CPU register. Do we agree
that in above example in second case when BE access is
on (E bit is on), it is the exactly the same memory transaction
but it data[0] = 0x1 is MSB of integer in CPU core register (still the
same LE host case)?

>> BTW could you please propose how will you see such
>> "32 bit transaction, value 0x04030201, address $whatever".
>> on ARM LE CPU in mmio.data?
>
> That is exactly the problem we're discussing in this thread.
> Indeed, I proposed an answer to it, which is that the mmio.data
> array should be in host kernel byte order, in which case it
> would be (for an LE host kernel) 0x01 in mmio.data[0] and so
> on up.
>
>> If it would be {0x01, 0x02, 0x03, 0x4} it is fine with
>> me. That is current case ARM LE case when above
>> snippets would be executed by guest.
>>
>> Would we  agree that the same arrangement would be
>> true for all other cases on ARM regardless of all other
>> endianities of qemu, KVM host, guest, hypervisor, etc?
>
> No; under my proposal, for a big-endian host kernel (and
> thus big-endian QEMU) the order would be
> mmio.data[0] = 0x04, etc. (This wouldn't change based
> on the guest kernel endianness or whether it happened
> to have set the E bit temporarily.)

Consider above big endian case (setend be) example,
but now running in BE KVM host. 0x4 is LSB of CPU
core register in this case.

> Defining that mmio.data[] is always little-endian would
> be a valid definition of an API if we were doing it from
> scratch. It has the unfortunate property that it would
> completely break the existing PPC BE setups, which
> don't define it that way, so it is a non-starter.

I believe, but I need to check, that PPC BE setup actually
acts as the second case in above example  If we have PPC
BE guest executing the followi

Re: kvm virtio ethernet ring on guest side over high throughput (packet per second)

2014-01-22 Thread Alejandro Comisario
Thank you so much Stefan for the help and cc'ing Michael & Jason.
Like you advised yesterday on IRC, today we are making some tests with
the application setting TCP_NODELAY in the socket options.

So we will try that and get back to you with further information.
In the mean time, maybe showing what options the vms are using while running !

# 
--
/usr/bin/kvm -S -M pc-1.0 -cpu
core2duo,+lahf_lm,+rdtscp,+pdpe1gb,+aes,+popcnt,+x2apic,+sse4.2,+sse4.1,+dca,+xtpr,+cx16,+tm2,+est,+vmx,+ds_cpl,+pbe,+tm,+ht,+ss,+acpi,+ds
-enable-kvm -m 32768 -smp 8,sockets=1,cores=6,threads=2 -name
instance-0254 -uuid d25b1b20-409e-4d7f-bd92-2ef4073c7c2b
-nodefconfig -nodefaults -chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/instance-0254.monitor,server,nowait
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc
-no-shutdown -kernel /var/lib/nova/instances/instance-0254/kernel
-initrd /var/lib/nova/instances/instance-0254/ramdisk -append
root=/dev/vda console=ttyS0 -drive
file=/var/lib/nova/instances/instance-0254/disk,if=none,id=drive-virtio-disk0,format=qcow2,cache=writethrough
-device 
virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
-netdev tap,fd=19,id=hostnet0 -device
virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:27:d4:6d,bus=pci.0,addr=0x3
-chardev 
file,id=charserial0,path=/var/lib/nova/instances/instance-0254/console.log
-device isa-serial,chardev=charserial0,id=serial0 -chardev
pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1
-usb -device usb-tablet,id=input0 -vnc 0.0.0.0:4 -k en-us -vga cirrus
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5
# 
--

best regards


Alejandro Comisario
#melicloud CloudBuilders
Arias 3751, Piso 7 (C1430CRG)
Ciudad de Buenos Aires - Argentina
Cel: +549(11) 15-3770-1857
Tel : +54(11) 4640-8443


On Wed, Jan 22, 2014 at 12:22 PM, Stefan Hajnoczi  wrote:
> On Tue, Jan 21, 2014 at 04:06:05PM -0200, Alejandro Comisario wrote:
>
> CCed Michael Tsirkin and Jason Wang who work on KVM networking.
>
>> Hi guys, we had in the past when using physical servers, several
>> throughput issues regarding the throughput of our APIS, in our case we
>> measure this with packets per seconds, since we dont have that much
>> bandwidth (Mb/s) since our apis respond lots of packets very small
>> ones (maximum response of 3.5k and avg response of 1.5k), when we
>> where using this physical servers, when we reach throughput capacity
>> (due to clients tiemouts) we touched the ethernet ring configuration
>> and we made the problem dissapear.
>>
>> Today with kvm and over 10k virtual instances, when we want to
>> increase the throughput of KVM instances, we bumped with the fact that
>> when using virtio on guests, we have a max configuration of the ring
>> of 256 TX/RX, and from the host side the atached vnet has a txqueuelen
>> of 500.
>>
>> What i want to know is, how can i tune the guest to support more
>> packets per seccond if i know that's my bottleneck?
>
> I suggest investigating performance in a systematic way.  Set up a
> benchmark that saturates the network.  Post the details of the benchmark
> and the results that you are seeing.
>
> Then, we can discuss how to investigate the root cause of the bottleneck.
>
>> * does virtio exposes more packets to configure in the virtual ethernet's 
>> ring ?
>
> No, ring size is hardcoded in QEMU (on the host).
>
>> * does the use of vhost_net helps me with increasing packets per
>> second and not only bandwidth?
>
> vhost_net is generally the most performant network option.
>
>> does anyone has to struggle with this before and knows where i can look into 
>> ?
>> there's LOOOTS of information about networking performance
>> tuning of kvm, but nothing related to increase throughput in pps
>> capacity.
>>
>> This is a couple of configurations that we are having right now on the
>> compute nodes:
>>
>> * 2x1Gb bonded interfaces (want to know the more than 20 models we are
>> using, just ask for it)
>> * Multi queue interfaces, pined via irq to different cores
>> * Linux bridges,  no VLAN, no open-vswitch
>> * ubuntu 12.04 kernel 3.2.0-[40-48]
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM and variable-endianness guest CPUs

2014-01-22 Thread Peter Maydell
On 22 January 2014 19:29, Victor Kamensky  wrote:
> On 22 January 2014 09:29, Peter Maydell  wrote:
>> This just isn't how real buses work. There is no
>> "address + 1, address + 2". There is a single address
>> for the memory transaction and a set of data on
>> data lines and some separate size information.
>
> Yes, and those data lines are just binary signal lines
> not numbers. If one would want to describe information
> on data lines as number he/she needs to assign integer
> bits numbers to lines, and that is absolutely arbitrary
> process

It is part of the definition of the bus which signal pin is
D0 and which is D31...

>  In one choose one way to assigned those
> bits to lines and another choose reverse way, they will
> talk about completely different numbers for the same
> signals on the bus. Such data lines enumeration has
> no reflection on how bus actually works. And I don't
> even see why it should be described just as single
> integer, for example one can describe information on
> data lines as set of 4 byte value, nothing wrong with
> such description.

It is not how the hardware works. If you describe it as
a set of 4 bytes, then you need to also say how you are
mapping from those 4 bytes to the actual 32 bit data
transaction the hardware is doing. Which is the question
we're trying to answer in this thread.

I've snipped a huge chunk of my initial reply to this email,
because it all boiled down to "sorry, you're just not correct
about how the hardware works" and it doesn't seem
necessary to repeat it three times. Devices really do see
"this is a transaction with this value and this size". They
do not in any way see a 32 bit word write as "this is a collection
of byte writes". Therefore:

 1) thinking about a 32 bit word write in terms of a byte array
is confusing
 2) since the KVM API is unfortunately stuck with this byte
   array, we must define the semantics of what it actually
   contains, so that the kernel and QEMU can go between
  "the value being read/written in the transaction" and
  "the contents of the byte array

>> As soon as you try to think of the mmio.data as a set
>> of bytes then you have to specify some endianness to
>> the data, so that both sides (kernel and userspace)
>> know how to reconstruct the actual data value from the
>> array of bytes.
>
> What actual value? In what sense? You need to bring
> into discussion semantic of this h/w address to really tell
> that.

I've just spent the last two emails doing exactly that.
The actual value, as in "this CPU just did a memory
transaction of a 32 bit data value".

> BTW could you please propose how will you see such
> "32 bit transaction, value 0x04030201, address $whatever".
> on ARM LE CPU in mmio.data?

That is exactly the problem we're discussing in this thread.
Indeed, I proposed an answer to it, which is that the mmio.data
array should be in host kernel byte order, in which case it
would be (for an LE host kernel) 0x01 in mmio.data[0] and so
on up.

> If it would be {0x01, 0x02, 0x03, 0x4} it is fine with
> me. That is current case ARM LE case when above
> snippets would be executed by guest.
>
> Would we  agree that the same arrangement would be
> true for all other cases on ARM regardless of all other
> endianities of qemu, KVM host, guest, hypervisor, etc?

No; under my proposal, for a big-endian host kernel (and
thus big-endian QEMU) the order would be
mmio.data[0] = 0x04, etc. (This wouldn't change based
on the guest kernel endianness or whether it happened
to have set the E bit temporarily.)

Defining that mmio.data[] is always little-endian would
be a valid definition of an API if we were doing it from
scratch. It has the unfortunate property that it would
completely break the existing PPC BE setups, which
don't define it that way, so it is a non-starter.

Defining it as being always guest-order would mean that
userspace had to continually look at the guest CPU
endianness bit, which is annoying and awkward.

Defining it as always host-endian order is the most
reasonable option available. It also happens to work
for the current QEMU code, which is nice.

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM and variable-endianness guest CPUs

2014-01-22 Thread Victor Kamensky
On 22 January 2014 09:29, Peter Maydell  wrote:
> On 22 January 2014 17:19, Victor Kamensky  wrote:
>> On 22 January 2014 02:22, Peter Maydell  wrote:
>>> but the major issue here is that the data being
>>> transferred is not just a bag of bytes. The data[]
>>> array plus the size field are being (mis)used to indicate
>>> that the memory transaction is one of:
>>>  * an 8 bit access
>>>  * a 16 bit access of some uint16_t value
>>>  * a 32 bit access of some uint32_t value
>>>  * a 64 bit access of some uint64_t value
>>>
>>> exactly as a CPU hardware bus would do. It's
>>> because the API is defined in this awkward way with
>>> a uint8_t[] array that we need to specify how both
>>> sides should go from the actual properties of the
>>> memory transaction (value and size) to filling in the
>>> array.
>>
>> While responding to Alex last night, I found, I think,
>> easiest and shortest way to think about mmio.data[]
>>
>> Just for discussion reference here it is again
>> struct {
>> __u64 phys_addr;
>> __u8  data[8];
>> __u32 len;
>> __u8  is_write;
>> } mmio;
>> I believe that in all cases it should be interpreted
>> in the following sense
>>byte data[0] goes into byte at phys_addr + 0
>>byte data[1] goes into byte at phys_addr + 1
>>byte data[2] goes into byte at phys_addr + 2
>>and so on up to len size
>>
>> Basically if it would be on real bus, get byte value
>> that corresponds to phys_addr + 0 address place
>> it into data[0], get byte value that corresponds to
>> phys_addr + 1 address place it into data[1], etc.
>
> This just isn't how real buses work. There is no
> "address + 1, address + 2". There is a single address
> for the memory transaction and a set of data on
> data lines and some separate size information.

Yes, and those data lines are just binary signal lines
not numbers. If one would want to describe information
on data lines as number he/she needs to assign integer
bits numbers to lines, and that is absolutely arbitrary
process  In one choose one way to assigned those
bits to lines and another choose reverse way, they will
talk about completely different numbers for the same
signals on the bus. Such data lines enumeration has
no reflection on how bus actually works. And I don't
even see why it should be described just as single
integer, for example one can describe information on
data lines as set of 4 byte value, nothing wrong with
such description.

> How the device at the far end of the bus chooses
> to respond to 32 bit accesses to address X versus
> 8 bit accesses to addresses X through X+3 is entirely
> its own business and unrelated to the CPU. (It would
> be perfectly possible to have a device which when
> you read from address X as 32 bits returned 0x12345678,
> when you read from address X as 16 bits returned
> 0x9abc, returned 0x42 for an 8 bit read from X+1,
> and so on. Having byte reads from X..X+3 return
> values corresponding to parts of the 32 bit access
> is purely a convention.)

I don't follow above, one may have one read from
device address X as 32 bits returned 0x12345678,
and another read from the same address X as 32 bit
returned 0xabcdef123, so what? Maybe real example
would help.

>> Note nowhere in my above description I've talked
>> about endianity of anything: device, access (E bit),
>> KVM host, guest, hypervisor. All these endianities
>> are irrelevant to mmio interface.
>
> As soon as you try to think of the mmio.data as a set
> of bytes then you have to specify some endianness to
> the data, so that both sides (kernel and userspace)
> know how to reconstruct the actual data value from the
> array of bytes.

What actual value? In what sense? You need to bring
into discussion semantic of this h/w address to really tell
that. Driver that reads/writes is aware about semantics
of those addresses. For example devices gives chanel1 byte
value at phys_addr, gives chanel2 byte value at
phys_addr + 1, so 16bit integer read from phys_addr
will bring two chanel values into register not one 16bit
integer.

>>> Furthermore, device endianness is entirely irrelevant
>>> for deciding the properties of mmio.data[], because the
>>> thing we're modelling here is essentially the CPU->bus
>>> interface. In real hardware, the properties of individual
>>> devices on the bus are irrelevant to how the CPU's
>>> interface to the bus behaves, and similarly here the
>>> properties of emulated devices don't affect how KVM's
>>> interface to QEMU userspace needs to work.
>>
>> As far as mmio interface concerned I claim that any
>> endianity is irrelevant here. I am utterly lost about
>> endianity of what you care about?
>
> I care about knowing which end of mmio.data is the
> least significant byte, obviously.

LSB of what? Memory semantics does not have
notion of LSB. It comes only when one start
interpreting memory content. memcpy does not
hav

Re: KVM and variable-endianness guest CPUs

2014-01-22 Thread Peter Maydell
On 22 January 2014 17:19, Victor Kamensky  wrote:
> On 22 January 2014 02:22, Peter Maydell  wrote:
>> but the major issue here is that the data being
>> transferred is not just a bag of bytes. The data[]
>> array plus the size field are being (mis)used to indicate
>> that the memory transaction is one of:
>>  * an 8 bit access
>>  * a 16 bit access of some uint16_t value
>>  * a 32 bit access of some uint32_t value
>>  * a 64 bit access of some uint64_t value
>>
>> exactly as a CPU hardware bus would do. It's
>> because the API is defined in this awkward way with
>> a uint8_t[] array that we need to specify how both
>> sides should go from the actual properties of the
>> memory transaction (value and size) to filling in the
>> array.
>
> While responding to Alex last night, I found, I think,
> easiest and shortest way to think about mmio.data[]
>
> Just for discussion reference here it is again
> struct {
> __u64 phys_addr;
> __u8  data[8];
> __u32 len;
> __u8  is_write;
> } mmio;
> I believe that in all cases it should be interpreted
> in the following sense
>byte data[0] goes into byte at phys_addr + 0
>byte data[1] goes into byte at phys_addr + 1
>byte data[2] goes into byte at phys_addr + 2
>and so on up to len size
>
> Basically if it would be on real bus, get byte value
> that corresponds to phys_addr + 0 address place
> it into data[0], get byte value that corresponds to
> phys_addr + 1 address place it into data[1], etc.

This just isn't how real buses work. There is no
"address + 1, address + 2". There is a single address
for the memory transaction and a set of data on
data lines and some separate size information.
How the device at the far end of the bus chooses
to respond to 32 bit accesses to address X versus
8 bit accesses to addresses X through X+3 is entirely
its own business and unrelated to the CPU. (It would
be perfectly possible to have a device which when
you read from address X as 32 bits returned 0x12345678,
when you read from address X as 16 bits returned
0x9abc, returned 0x42 for an 8 bit read from X+1,
and so on. Having byte reads from X..X+3 return
values corresponding to parts of the 32 bit access
is purely a convention.)

> Note nowhere in my above description I've talked
> about endianity of anything: device, access (E bit),
> KVM host, guest, hypervisor. All these endianities
> are irrelevant to mmio interface.

As soon as you try to think of the mmio.data as a set
of bytes then you have to specify some endianness to
the data, so that both sides (kernel and userspace)
know how to reconstruct the actual data value from the
array of bytes.

>> Furthermore, device endianness is entirely irrelevant
>> for deciding the properties of mmio.data[], because the
>> thing we're modelling here is essentially the CPU->bus
>> interface. In real hardware, the properties of individual
>> devices on the bus are irrelevant to how the CPU's
>> interface to the bus behaves, and similarly here the
>> properties of emulated devices don't affect how KVM's
>> interface to QEMU userspace needs to work.
>
> As far as mmio interface concerned I claim that any
> endianity is irrelevant here. I am utterly lost about
> endianity of what you care about?

I care about knowing which end of mmio.data is the
least significant byte, obviously.

> Consider
> the following ARM code snippets:
>
> setend le
> mov r1, #0x04030201
> str r1, [r0]
>
> and
>
> setend be
> mov r1, #0x01020304
> str r1, [r0]
>
> when above snippets are executed memory bus
> sees absolutely the same thing, can you tell by
> looking at this memory transaction what endianity
> is it? And endianity of what? I can't.

That is correct. That is because the value sent out on
the bus from the CPU is always the same: it says
"32 bit transaction, value 0x04030201, address $whatever".

> The only thing you can tell by looking at this bus
> memory transaction is that 0x01 byte value goes at
> r0 address, 0x02 byte value goes at r0 + 1 address,
> etc.

No, this part is absolutely wrong, see above.

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/13] kvm-unit-tests: Various improvements for x86 tests

2014-01-22 Thread Paolo Bonzini

Il 04/01/2014 18:59, Jan Kiszka ha scritto:

Highlights:
 - improved preemption timer and interrupt injection tests
   (obsoletes my two patches in vmx queue)
 - tests for IA32_APIC_BASE writes
 - test for unconditional IO exiting (VMX)
 - basic test of debug facilities (hw breakpoints etc.)

Jan Kiszka (13):
  VMX: Add test cases around interrupt injection and halting
  VMX: Extend preemption timer tests
  apic: Remove redundant enable_apic
  VMX: Fix return label in fault-triggering handlers
  lib/x86: Move exception test code into library
  x2apic: Test for invalid state transitions
  lib/x86/apic: Consolidate over MSR_IA32_APICBASE
  apic: Add test case for relocation and writing reserved bits
  VMX: Check unconditional I/O exiting
  Provide common report and report_summary services
  Ignore *.elf build outputs
  svm: Add missing build dependency
  x86: Add debug facility test case

 .gitignore|   1 +
 Makefile  |   3 +-
 config-x86-common.mak |   4 +-
 config-x86_64.mak |   2 +-
 lib/libcflat.h|   4 +
 lib/report.c  |  36 +++
 lib/x86/apic-defs.h   |   3 +
 lib/x86/apic.c|   7 +-
 lib/x86/desc.c|  24 +
 lib/x86/desc.h|   6 ++
 x86/apic.c|  84 +---
 x86/debug.c   | 113 +
 x86/emulator.c|  16 +--
 x86/eventinj.c|  15 +--
 x86/idt_test.c|  21 +---
 x86/msr.c |  15 +--
 x86/pcid.c|  14 +--
 x86/pmu.c |  37 +++
 x86/taskswitch2.c |  15 +--
 x86/unittests.cfg |   3 +
 x86/vmx.c |  57 +++
 x86/vmx.h |   4 +-
 x86/vmx_tests.c   | 264 ++
 23 files changed, 548 insertions(+), 200 deletions(-)
 create mode 100644 lib/report.c
 create mode 100644 x86/debug.c



I'm applying patches 3-13 to the master branch, and patches 1-2 to the 
vmx branch.


Thanks!

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM and variable-endianness guest CPUs

2014-01-22 Thread Victor Kamensky
Hi Peter,

On 22 January 2014 02:22, Peter Maydell  wrote:
> On 22 January 2014 05:39, Victor Kamensky  wrote:
>> Hi Guys,
>>
>> Christoffer and I had a bit heated chat :) on this
>> subject last night. Christoffer, really appreciate
>> your time! We did not really reach agreement
>> during the chat and Christoffer asked me to follow
>> up on this thread.
>> Here it goes. Sorry, it is very long email.
>>
>> I don't believe we can assign any endianity to
>> mmio.data[] byte array. I believe mmio.data[] and
>> mmio.len acts just memcpy and that is all. As
>> memcpy does not imply any endianity of underlying
>> data mmio.data[] should not either.
>
> This email is about five times too long to be actually
> useful,

Sorry, about that you may be right.
My below responses much shorter :)

> but the major issue here is that the data being
> transferred is not just a bag of bytes. The data[]
> array plus the size field are being (mis)used to indicate
> that the memory transaction is one of:
>  * an 8 bit access
>  * a 16 bit access of some uint16_t value
>  * a 32 bit access of some uint32_t value
>  * a 64 bit access of some uint64_t value
>
> exactly as a CPU hardware bus would do. It's
> because the API is defined in this awkward way with
> a uint8_t[] array that we need to specify how both
> sides should go from the actual properties of the
> memory transaction (value and size) to filling in the
> array.

While responding to Alex last night, I found, I think,
easiest and shortest way to think about mmio.data[]

Just for discussion reference here it is again
struct {
__u64 phys_addr;
__u8  data[8];
__u32 len;
__u8  is_write;
} mmio;
I believe that in all cases it should be interpreted
in the following sense
   byte data[0] goes into byte at phys_addr + 0
   byte data[1] goes into byte at phys_addr + 1
   byte data[2] goes into byte at phys_addr + 2
   and so on up to len size

Basically if it would be on real bus, get byte value
that corresponds to phys_addr + 0 address place
it into data[0], get byte value that corresponds to
phys_addr + 1 address place it into data[1], etc.

I believe it is true for current ARM LE case and
PPC BE case. I am asking you to keep it this way
for all other cases. My ARM BE V7 KVM patches
still use it in the same sense.

What is wrong with it?

Note nowhere in my above description I've talked
about endianity of anything: device, access (E bit),
KVM host, guest, hypervisor. All these endianities
are irrelevant to mmio interface.

> Furthermore, device endianness is entirely irrelevant
> for deciding the properties of mmio.data[], because the
> thing we're modelling here is essentially the CPU->bus
> interface. In real hardware, the properties of individual
> devices on the bus are irrelevant to how the CPU's
> interface to the bus behaves, and similarly here the
> properties of emulated devices don't affect how KVM's
> interface to QEMU userspace needs to work.

As far as mmio interface concerned I claim that any
endianity is irrelevant here. I am utterly lost about
endianity of what you care about? Consider
the following ARM code snippets:

setend le
mov r1, #0x04030201
str r1, [r0]

and

setend be
mov r1, #0x01020304
str r1, [r0]

when above snippets are executed memory bus
sees absolutely the same thing, can you tell by
looking at this memory transaction what endianity
is it? And endianity of what? I can't.
The only thing you can tell by looking at this bus
memory transaction is that 0x01 byte value goes at
r0 address, 0x02 byte value goes at r0 + 1 address,
etc.

Thanks,
Victor

> MemoryRegion's 'endianness' field, incidentally, is
> a dreadful mess that we should get rid of. It is attempting
> to model the property that some buses/bridges have of
> doing byte-lane-swaps on data that passes through as
> a property of the device itself. It would be better if we
> modelled it properly, with container regions having possible
> byte-swapping and devices just being devices.
>
> thanks
> -- PMM
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-kmod]

2014-01-22 Thread Paolo Bonzini
After KVM commit 8a3caa6d74597c2a083f7c87f866891a0b12540b, kvm-kmod
is broken in weird ways (for me it breaks every other time kvm is
loaded, but only with ept=0...).

The reason is that, after this commit, empty_zero_page is expected
to be page-aligned, but the kvm-kmod compatibility shim isn't.
empty_zero_page has been exported since v2.6.25:

commit 8232fd625217dc641ed05dd238a8bb5c82828082
Author: Theodore Ts'o 
Date:   Mon Nov 26 20:42:19 2007 +0100

x86: export the symbol empty_zero_page on the 32-bit x86 architecture

The latest KVM driver wants to use the empty_zero_page symbol, and it's
not exported in 32-bit x86 (although it is exported by x86_64, s390, and
uml architectures).

Signed-off-by: "Theodore Ts'o" 
Cc: t...@linutronix.de
Cc: linux-ker...@vger.kernel.com
Cc: kvm-de...@lists.sourceforge.net
Signed-off-by: Thomas Gleixner 
Signed-off-by: Ingo Molnar 

so the compatibility shim should probably just be dropped.

Signed-off-by: Paolo Bonzini 

---


diff --git a/external-module-compat-comm.h b/external-module-compat-comm.h
index 34fb320..580aa9f 100644
--- a/external-module-compat-comm.h
+++ b/external-module-compat-comm.h
@@ -180,18 +180,6 @@ void kvm_smp_send_reschedule(int cpu);
 
 #endif
 
-/* empty_zero_page isn't exported in all kernels */
-#include 
-
-#define empty_zero_page kvm_empty_zero_page
-
-static char empty_zero_page[PAGE_SIZE];
-
-static inline void blahblah(void)
-{
-   (void)empty_zero_page[0];
-}
-
 /* __mmdrop() is not exported before 2.6.25 */
 #include 
 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/13] VMX: Fix return label in fault-triggering handlers

2014-01-22 Thread Paolo Bonzini

Il 22/01/2014 16:00, Paolo Bonzini ha scritto:

Il 04/01/2014 18:59, Jan Kiszka ha scritto:

From: Jan Kiszka 

Some compiler versions (seen with gcc 4.8.1) move the resume label after
the return statement which, of course, causes sever problems.


Can you include the assembly output?  Do you mean after the "ret"
instruction?


Reproduced now, I think it's a compiler bug... with -O2 it's even weird, 
&&resume points to the *first* instruction in the function.


I'll report it to GCC.


Paolo

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH uq/master] kvm: always update the MPX model specific register

2014-01-22 Thread Paolo Bonzini

Il 22/01/2014 16:29, Marcelo Tosatti ha scritto:

> The original patch from Liu Jinsong restricted them to reset or full
> state updates, but that's unnecessary (and wrong) since the BNDCFGS
> MSR has no side effects.

Why is it necessary to save/restore BNDCFGS MSR on states other
than FULL and RESET?


Yes, nothing in QEMU except reset will touch the MSR, but this applies 
also to all the other MSRs that are saved unconditionally.  It's nice to 
be able to poke them with gdb, and saving/restoring the MSR provides that.


Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Monotonic clock with KVM pv-clock

2014-01-22 Thread Marcelo Tosatti
On Tue, Jan 21, 2014 at 11:23:37PM +0900, Fernando Luis Vazquez Cao wrote:
> (2014/01/21 9:19), Marcelo Tosatti wrote:
> >On Mon, Jan 20, 2014 at 11:59:39PM +0900, Fernando Luis Vazquez Cao wrote:
> >>(2014/01/20 22:33), Marcelo Tosatti wrote:
> >>>On Mon, Jan 20, 2014 at 11:56:56AM +0200, Nadav Har'El wrote:
> If KVM_SYSTEM_TIME is not a correct way to get a monotonic paravirtual 
> clock
> from KVM, is there a correct way?
> >>>Inside a Linux guest? Can use sched_clock().
> >>I would like to mention that Linux guests usually do not use sched_clock()
> >>directly. The reason being that the kvm_clock based sched_clock() is not
> >>marked stable (sched_clock_stable is 0), which means that the pair of
> >>wrappers sched_clock_local()/sched_clock_remote() is used instead.
> >Should verify the requirements of sched_clock_cpu() and enable
> >sched_clock_stable in case it fulfills requirements
> 
> The requirements of cpu_clock() are (from code comments in
> kernel/sched/clock.c):
> 
> 1. cpu_clock(i) provides a fast (execution time) high resolution
> 2. Clock with bounded drift between CPUs.

How high of a resolution ? Or what is the effect of lowering resolution?

> 3. The value of cpu_clock(i) is monotonic for constant i (when
> comparing cpu_clock(i)
>to cpu_clock(j) for i != j, time can go backwards)

OK.

> 4. The timestamp returned is in nanoseconds.

OK.

> Besides, for sched_clock() to be considered stable
> 
> 5. it has to be synchronized across CPUs.

This contradicts item 3 ?

> >(kvmclock_read can be nondecreasing due to TSC->nanosecond scaling,
> 
> s/nondecreasing/nonincreasing/? 

Yes.

> I guess you mean that consecutive calls to
> kvm_clock_read() can return the same nanosecond value when the TSC
> frequency is greater than 1GHz or there are frequency changes.

Yes, when CPU can execute more than one kvm_clock_read's in one
nanosecond.

> >and not increase for a longer duration with global accumulator, due
> >to cmpxchg)
> 
> Yes. If I understand correctly, that can happen if
> PVCLOCK_TSC_STABLE_BIT is not set.

Yes.

> I think that when PVCLOCK_TSC_STABLE_BIT is set we should be fine as
> long as backward motion is filtered out.

There can be no backward motion with PVCLOCK_TSC_STABLE_BIT (in theory,
and no reports have been seen yet).

Do you have any numbers for the change?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH uq/master] kvm: always update the MPX model specific register

2014-01-22 Thread Marcelo Tosatti
On Mon, Jan 20, 2014 at 02:25:36PM +0100, Paolo Bonzini wrote:
> The original patch from Liu Jinsong restricted them to reset or full
> state updates, but that's unnecessary (and wrong) since the BNDCFGS
> MSR has no side effects.

Why is it necessary to save/restore BNDCFGS MSR on states other
than FULL and RESET?

> Cc: Liu Jinsong 
> Signed-off-by: Paolo Bonzini 
> ---
>  target-i386/kvm.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> index 221c8a0..d34981f 100644
> --- a/target-i386/kvm.c
> +++ b/target-i386/kvm.c
> @@ -1161,6 +1161,9 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
>  kvm_msr_entry_set(&msrs[n++], MSR_IA32_MISC_ENABLE,
>env->msr_ia32_misc_enable);
>  }
> +if (has_msr_bndcfgs) {
> +kvm_msr_entry_set(&msrs[n++], MSR_IA32_BNDCFGS, env->msr_bndcfgs);
> +}
>  #ifdef TARGET_X86_64
>  if (lm_capable_kernel) {
>  kvm_msr_entry_set(&msrs[n++], MSR_CSTAR, env->cstar);
> @@ -1224,9 +1227,6 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
>  if (cpu->hyperv_vapic) {
>  kvm_msr_entry_set(&msrs[n++], HV_X64_MSR_APIC_ASSIST_PAGE, 0);
>  }
> -if (has_msr_bndcfgs) {
> -kvm_msr_entry_set(&msrs[n++], MSR_IA32_BNDCFGS, 
> env->msr_bndcfgs);
> -}
>  
>  /* Note: MSR_IA32_FEATURE_CONTROL is written separately, see
>   *   kvm_put_msr_feature_control. */
> -- 
> 1.8.3.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm virtio ethernet ring on guest side over high throughput (packet per second)

2014-01-22 Thread Stefan Hajnoczi
On Tue, Jan 21, 2014 at 04:06:05PM -0200, Alejandro Comisario wrote:

CCed Michael Tsirkin and Jason Wang who work on KVM networking.

> Hi guys, we had in the past when using physical servers, several
> throughput issues regarding the throughput of our APIS, in our case we
> measure this with packets per seconds, since we dont have that much
> bandwidth (Mb/s) since our apis respond lots of packets very small
> ones (maximum response of 3.5k and avg response of 1.5k), when we
> where using this physical servers, when we reach throughput capacity
> (due to clients tiemouts) we touched the ethernet ring configuration
> and we made the problem dissapear.
> 
> Today with kvm and over 10k virtual instances, when we want to
> increase the throughput of KVM instances, we bumped with the fact that
> when using virtio on guests, we have a max configuration of the ring
> of 256 TX/RX, and from the host side the atached vnet has a txqueuelen
> of 500.
> 
> What i want to know is, how can i tune the guest to support more
> packets per seccond if i know that's my bottleneck?

I suggest investigating performance in a systematic way.  Set up a
benchmark that saturates the network.  Post the details of the benchmark
and the results that you are seeing.

Then, we can discuss how to investigate the root cause of the bottleneck.

> * does virtio exposes more packets to configure in the virtual ethernet's 
> ring ?

No, ring size is hardcoded in QEMU (on the host).

> * does the use of vhost_net helps me with increasing packets per
> second and not only bandwidth?

vhost_net is generally the most performant network option.

> does anyone has to struggle with this before and knows where i can look into ?
> there's LOOOTS of information about networking performance
> tuning of kvm, but nothing related to increase throughput in pps
> capacity.
> 
> This is a couple of configurations that we are having right now on the
> compute nodes:
> 
> * 2x1Gb bonded interfaces (want to know the more than 20 models we are
> using, just ask for it)
> * Multi queue interfaces, pined via irq to different cores
> * Linux bridges,  no VLAN, no open-vswitch
> * ubuntu 12.04 kernel 3.2.0-[40-48]
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/13] VMX: Fix return label in fault-triggering handlers

2014-01-22 Thread Paolo Bonzini

Il 04/01/2014 18:59, Jan Kiszka ha scritto:

From: Jan Kiszka 

Some compiler versions (seen with gcc 4.8.1) move the resume label after
the return statement which, of course, causes sever problems.


Can you include the assembly output?  Do you mean after the "ret" 
instruction?


The return should not be needed except for the rule that a label must be 
followed by a statement.  But I would not except any difference between 
a ";" and a barrier in this case.


Paolo


Signed-off-by: Jan Kiszka 
---
 x86/vmx.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/x86/vmx.c b/x86/vmx.c
index a475aec..f9d5493 100644
--- a/x86/vmx.c
+++ b/x86/vmx.c
@@ -563,7 +563,7 @@ static void do_vmxon_off(void)
vmx_on();
vmx_off();
 resume:
-   return;
+   barrier();
 }

 static void do_write_feature_control(void)
@@ -572,7 +572,7 @@ static void do_write_feature_control(void)
barrier();
wrmsr(MSR_IA32_FEATURE_CONTROL, 0);
 resume:
-   return;
+   barrier();
 }

 static int test_vmx_feature_control(void)



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 08/10] ARM: KVM: introduce per-vcpu HYP Configuration Register

2014-01-22 Thread Marc Zyngier
So far, KVM/ARM used a fixed HCR configuration per guest, except for
the VI/VF/VA bits to control the interrupt in absence of VGIC.

With the upcoming need to dynamically reconfigure trapping, it becomes
necessary to allow the HCR to be changed on a per-vcpu basis.

The fix here is to mimic what KVM/arm64 already does: a per vcpu HCR
field, initialized at setup time.

Signed-off-by: Marc Zyngier 
---
 arch/arm/include/asm/kvm_arm.h  | 1 -
 arch/arm/include/asm/kvm_host.h | 9 ++---
 arch/arm/kernel/asm-offsets.c   | 1 +
 arch/arm/kvm/guest.c| 1 +
 arch/arm/kvm/interrupts_head.S  | 9 +++--
 5 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h
index 1d3153c..a843e74 100644
--- a/arch/arm/include/asm/kvm_arm.h
+++ b/arch/arm/include/asm/kvm_arm.h
@@ -69,7 +69,6 @@
 #define HCR_GUEST_MASK (HCR_TSC | HCR_TSW | HCR_TWI | HCR_VM | HCR_BSU_IS | \
HCR_FB | HCR_TAC | HCR_AMO | HCR_IMO | HCR_FMO | \
HCR_TWE | HCR_SWIO | HCR_TIDCP)
-#define HCR_VIRT_EXCP_MASK (HCR_VA | HCR_VI | HCR_VF)
 
 /* System Control Register (SCTLR) bits */
 #define SCTLR_TE   (1 << 30)
diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index ba6d33a..918fdc1 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -101,6 +101,12 @@ struct kvm_vcpu_arch {
/* The CPU type we expose to the VM */
u32 midr;
 
+   /* HYP trapping configuration */
+   u32 hcr;
+
+   /* Interrupt related fields */
+   u32 irq_lines;  /* IRQ and FIQ levels */
+
/* Exception Information */
struct kvm_vcpu_fault_info fault;
 
@@ -128,9 +134,6 @@ struct kvm_vcpu_arch {
/* IO related fields */
struct kvm_decode mmio_decode;
 
-   /* Interrupt related fields */
-   u32 irq_lines;  /* IRQ and FIQ levels */
-
/* Cache some mmu pages needed inside spinlock regions */
struct kvm_mmu_memory_cache mmu_page_cache;
 
diff --git a/arch/arm/kernel/asm-offsets.c b/arch/arm/kernel/asm-offsets.c
index dbe0476..713e807 100644
--- a/arch/arm/kernel/asm-offsets.c
+++ b/arch/arm/kernel/asm-offsets.c
@@ -174,6 +174,7 @@ int main(void)
   DEFINE(VCPU_FIQ_REGS,offsetof(struct kvm_vcpu, 
arch.regs.fiq_regs));
   DEFINE(VCPU_PC,  offsetof(struct kvm_vcpu, 
arch.regs.usr_regs.ARM_pc));
   DEFINE(VCPU_CPSR,offsetof(struct kvm_vcpu, 
arch.regs.usr_regs.ARM_cpsr));
+  DEFINE(VCPU_HCR, offsetof(struct kvm_vcpu, arch.hcr));
   DEFINE(VCPU_IRQ_LINES,   offsetof(struct kvm_vcpu, arch.irq_lines));
   DEFINE(VCPU_HSR, offsetof(struct kvm_vcpu, arch.fault.hsr));
   DEFINE(VCPU_HxFAR,   offsetof(struct kvm_vcpu, arch.fault.hxfar));
diff --git a/arch/arm/kvm/guest.c b/arch/arm/kvm/guest.c
index 20f8d97..0c8c044 100644
--- a/arch/arm/kvm/guest.c
+++ b/arch/arm/kvm/guest.c
@@ -38,6 +38,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
 
 int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu)
 {
+   vcpu->arch.hcr = HCR_GUEST_MASK;
return 0;
 }
 
diff --git a/arch/arm/kvm/interrupts_head.S b/arch/arm/kvm/interrupts_head.S
index 4a2a97a..7cb41e1 100644
--- a/arch/arm/kvm/interrupts_head.S
+++ b/arch/arm/kvm/interrupts_head.S
@@ -597,17 +597,14 @@ vcpu  .reqr0  @ vcpu pointer always 
in r0
 
 /* Enable/Disable: stage-2 trans., trap interrupts, trap wfi, trap smc */
 .macro configure_hyp_role operation
-   mrc p15, 4, r2, c1, c1, 0   @ HCR
-   bic r2, r2, #HCR_VIRT_EXCP_MASK
-   ldr r3, =HCR_GUEST_MASK
.if \operation == vmentry
-   orr r2, r2, r3
+   ldr r2, [vcpu, #VCPU_HCR]
ldr r3, [vcpu, #VCPU_IRQ_LINES]
orr r2, r2, r3
.else
-   bic r2, r2, r3
+   mov r2, #0
.endif
-   mcr p15, 4, r2, c1, c1, 0
+   mcr p15, 4, r2, c1, c1, 0   @ HCR
 .endm
 
 .macro load_vcpu
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 09/10] ARM: KVM: trap VM system registers until MMU and caches are ON

2014-01-22 Thread Marc Zyngier
In order to be able to detect the point where the guest enables
its MMU and caches, trap all the VM related system registers.

Once we see the guest enabling both the MMU and the caches, we
can go back to a saner mode of operation, which is to leave these
registers in complete control of the guest.

Signed-off-by: Marc Zyngier 
---
 arch/arm/include/asm/kvm_arm.h |  3 +-
 arch/arm/kvm/coproc.c  | 85 ++
 arch/arm/kvm/coproc_a15.c  |  2 +-
 arch/arm/kvm/coproc_a7.c   |  2 +-
 4 files changed, 73 insertions(+), 19 deletions(-)

diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h
index a843e74..816db0b 100644
--- a/arch/arm/include/asm/kvm_arm.h
+++ b/arch/arm/include/asm/kvm_arm.h
@@ -55,6 +55,7 @@
  * The bits we set in HCR:
  * TAC:Trap ACTLR
  * TSC:Trap SMC
+ * TVM:Trap VM ops (until MMU and caches are on)
  * TSW:Trap cache operations by set/way
  * TWI:Trap WFI
  * TWE:Trap WFE
@@ -68,7 +69,7 @@
  */
 #define HCR_GUEST_MASK (HCR_TSC | HCR_TSW | HCR_TWI | HCR_VM | HCR_BSU_IS | \
HCR_FB | HCR_TAC | HCR_AMO | HCR_IMO | HCR_FMO | \
-   HCR_TWE | HCR_SWIO | HCR_TIDCP)
+   HCR_TVM | HCR_TWE | HCR_SWIO | HCR_TIDCP)
 
 /* System Control Register (SCTLR) bits */
 #define SCTLR_TE   (1 << 30)
diff --git a/arch/arm/kvm/coproc.c b/arch/arm/kvm/coproc.c
index 126c90d..1839770 100644
--- a/arch/arm/kvm/coproc.c
+++ b/arch/arm/kvm/coproc.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -205,6 +206,55 @@ done:
 }
 
 /*
+ * Generic accessor for VM registers. Only called as long as HCR_TVM
+ * is set.
+ */
+static bool access_vm_reg(struct kvm_vcpu *vcpu,
+ const struct coproc_params *p,
+ const struct coproc_reg *r)
+{
+   if (p->is_write) {
+   vcpu->arch.cp15[r->reg] = *vcpu_reg(vcpu, p->Rt1);
+   if (p->is_64bit)
+   vcpu->arch.cp15[r->reg + 1] = *vcpu_reg(vcpu, p->Rt2);
+   } else {
+   *vcpu_reg(vcpu, p->Rt1) = vcpu->arch.cp15[r->reg];
+   if (p->is_64bit)
+   *vcpu_reg(vcpu, p->Rt2) = vcpu->arch.cp15[r->reg + 1];
+   }
+
+   return true;
+}
+
+/*
+ * SCTLR accessor. Only called as long as HCR_TVM is set.  If the
+ * guest enables the MMU, we stop trapping the VM sys_regs and leave
+ * it in complete control of the caches.
+ *
+ * Used by the cpu-specific code.
+ */
+bool access_sctlr(struct kvm_vcpu *vcpu,
+ const struct coproc_params *p,
+ const struct coproc_reg *r)
+{
+   if (p->is_write) {
+   unsigned long val;
+
+   val = *vcpu_reg(vcpu, p->Rt1);
+   vcpu->arch.cp15[r->reg] = val;
+
+   if ((val & (0b101)) == 0b101) { /* MMU+Caches enabled? */
+   vcpu->arch.hcr &= ~HCR_TVM;
+   stage2_flush_vm(vcpu->kvm);
+   }
+   } else {
+   *vcpu_reg(vcpu, p->Rt1) = vcpu->arch.cp15[r->reg];
+   }
+
+   return true;
+}
+
+/*
  * We could trap ID_DFR0 and tell the guest we don't support performance
  * monitoring.  Unfortunately the patch to make the kernel check ID_DFR0 was
  * NAKed, so it will read the PMCR anyway.
@@ -261,33 +311,36 @@ static const struct coproc_reg cp15_regs[] = {
{ CRn( 1), CRm( 0), Op1( 0), Op2( 2), is32,
NULL, reset_val, c1_CPACR, 0x },
 
-   /* TTBR0/TTBR1: swapped by interrupt.S. */
-   { CRm64( 2), Op1( 0), is64, NULL, reset_unknown64, c2_TTBR0 },
-   { CRm64( 2), Op1( 1), is64, NULL, reset_unknown64, c2_TTBR1 },
-
-   /* TTBCR: swapped by interrupt.S. */
+   /* TTBR0/TTBR1/TTBCR: swapped by interrupt.S. */
+   { CRm64( 2), Op1( 0), is64, access_vm_reg, reset_unknown64, c2_TTBR0 },
+   { CRn(2), CRm( 0), Op1( 0), Op2( 0), is32,
+   access_vm_reg, reset_unknown, c2_TTBR0 },
+   { CRn(2), CRm( 0), Op1( 0), Op2( 1), is32,
+   access_vm_reg, reset_unknown, c2_TTBR1 },
{ CRn( 2), CRm( 0), Op1( 0), Op2( 2), is32,
-   NULL, reset_val, c2_TTBCR, 0x },
+   access_vm_reg, reset_val, c2_TTBCR, 0x },
+   { CRm64( 2), Op1( 1), is64, access_vm_reg, reset_unknown64, c2_TTBR1 },
+
 
/* DACR: swapped by interrupt.S. */
{ CRn( 3), CRm( 0), Op1( 0), Op2( 0), is32,
-   NULL, reset_unknown, c3_DACR },
+   access_vm_reg, reset_unknown, c3_DACR },
 
/* DFSR/IFSR/ADFSR/AIFSR: swapped by interrupt.S. */
{ CRn( 5), CRm( 0), Op1( 0), Op2( 0), is32,
-   NULL, reset_unknown, c5_DFSR },
+   access_vm_reg, reset_unknown, c5_DFSR },
{ 

[PATCH v2 10/10] ARM: KVM: add world-switch for AMAIR{0,1}

2014-01-22 Thread Marc Zyngier
HCR.TVM traps (among other things) accesses to AMAIR0 and AMAIR1.
In order to minimise the amount of surprise a guest could generate by
trying to access these registers with caches off, add them to the
list of registers we switch/handle.

Signed-off-by: Marc Zyngier 
---
 arch/arm/include/asm/kvm_asm.h |  4 +++-
 arch/arm/kvm/coproc.c  |  6 ++
 arch/arm/kvm/interrupts_head.S | 12 ++--
 3 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/arch/arm/include/asm/kvm_asm.h b/arch/arm/include/asm/kvm_asm.h
index 661da11..53b3c4a 100644
--- a/arch/arm/include/asm/kvm_asm.h
+++ b/arch/arm/include/asm/kvm_asm.h
@@ -48,7 +48,9 @@
 #define c13_TID_URO26  /* Thread ID, User R/O */
 #define c13_TID_PRIV   27  /* Thread ID, Privileged */
 #define c14_CNTKCTL28  /* Timer Control Register (PL1) */
-#define NR_CP15_REGS   29  /* Number of regs (incl. invalid) */
+#define c10_AMAIR0 29  /* Auxilary Memory Attribute Indirection Reg0 */
+#define c10_AMAIR1 30  /* Auxilary Memory Attribute Indirection Reg1 */
+#define NR_CP15_REGS   31  /* Number of regs (incl. invalid) */
 
 #define ARM_EXCEPTION_RESET  0
 #define ARM_EXCEPTION_UNDEFINED   1
diff --git a/arch/arm/kvm/coproc.c b/arch/arm/kvm/coproc.c
index 1839770..539f6d4 100644
--- a/arch/arm/kvm/coproc.c
+++ b/arch/arm/kvm/coproc.c
@@ -381,6 +381,12 @@ static const struct coproc_reg cp15_regs[] = {
{ CRn(10), CRm( 2), Op1( 0), Op2( 1), is32,
access_vm_reg, reset_unknown, c10_NMRR},
 
+   /* AMAIR0/AMAIR1: swapped by interrupt.S. */
+   { CRn(10), CRm( 3), Op1( 0), Op2( 0), is32,
+   access_vm_reg, reset_unknown, c10_AMAIR0},
+   { CRn(10), CRm( 3), Op1( 0), Op2( 1), is32,
+   access_vm_reg, reset_unknown, c10_AMAIR1},
+
/* VBAR: swapped by interrupt.S. */
{ CRn(12), CRm( 0), Op1( 0), Op2( 0), is32,
NULL, reset_val, c12_VBAR, 0x },
diff --git a/arch/arm/kvm/interrupts_head.S b/arch/arm/kvm/interrupts_head.S
index 7cb41e1..e4eaf30 100644
--- a/arch/arm/kvm/interrupts_head.S
+++ b/arch/arm/kvm/interrupts_head.S
@@ -303,13 +303,17 @@ vcpu  .reqr0  @ vcpu pointer always 
in r0
 
mrc p15, 0, r2, c14, c1, 0  @ CNTKCTL
mrrcp15, 0, r4, r5, c7  @ PAR
+   mrc p15, 0, r6, c10, c3, 0  @ AMAIR0
+   mrc p15, 0, r7, c10, c3, 1  @ AMAIR1
 
.if \store_to_vcpu == 0
-   push{r2,r4-r5}
+   push{r2,r4-r7}
.else
str r2, [vcpu, #CP15_OFFSET(c14_CNTKCTL)]
add r12, vcpu, #CP15_OFFSET(c7_PAR)
strdr4, r5, [r12]
+   str r6, [vcpu, #CP15_OFFSET(c10_AMAIR0)]
+   str r7, [vcpu, #CP15_OFFSET(c10_AMAIR1)]
.endif
 .endm
 
@@ -322,15 +326,19 @@ vcpu  .reqr0  @ vcpu pointer always 
in r0
  */
 .macro write_cp15_state read_from_vcpu
.if \read_from_vcpu == 0
-   pop {r2,r4-r5}
+   pop {r2,r4-r7}
.else
ldr r2, [vcpu, #CP15_OFFSET(c14_CNTKCTL)]
add r12, vcpu, #CP15_OFFSET(c7_PAR)
ldrdr4, r5, [r12]
+   ldr r6, [vcpu, #CP15_OFFSET(c10_AMAIR0)]
+   ldr r7, [vcpu, #CP15_OFFSET(c10_AMAIR1)]
.endif
 
mcr p15, 0, r2, c14, c1, 0  @ CNTKCTL
mcrrp15, 0, r4, r5, c7  @ PAR
+   mcr p15, 0, r6, c10, c3, 0  @ AMAIR0
+   mcr p15, 0, r7, c10, c3, 1  @ AMAIR1
 
.if \read_from_vcpu == 0
pop {r2-r12}
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 01/10] arm64: KVM: force cache clean on page fault when caches are off

2014-01-22 Thread Marc Zyngier
In order for the guest with caches off to observe data written
contained in a given page, we need to make sure that page is
committed to memory, and not just hanging in the cache (as
guest accesses are completely bypassing the cache until it
decides to enable it).

For this purpose, hook into the coherent_icache_guest_page
function and flush the region if the guest SCTLR_EL1
register doesn't show the MMU  and caches as being enabled.
The function also get renamed to coherent_cache_guest_page.

Signed-off-by: Marc Zyngier 
Reviewed-by: Catalin Marinas 
---
 arch/arm/include/asm/kvm_mmu.h   |  4 ++--
 arch/arm/kvm/mmu.c   |  4 ++--
 arch/arm64/include/asm/kvm_mmu.h | 11 +++
 3 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 77de4a4..f997b9e 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -116,8 +116,8 @@ static inline void kvm_set_s2pmd_writable(pmd_t *pmd)
 
 struct kvm;
 
-static inline void coherent_icache_guest_page(struct kvm *kvm, hva_t hva,
- unsigned long size)
+static inline void coherent_cache_guest_page(struct kvm_vcpu *vcpu, hva_t hva,
+unsigned long size)
 {
/*
 * If we are going to insert an instruction page and the icache is
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 5809069..415fd63 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -713,7 +713,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
kvm_set_s2pmd_writable(&new_pmd);
kvm_set_pfn_dirty(pfn);
}
-   coherent_icache_guest_page(kvm, hva & PMD_MASK, PMD_SIZE);
+   coherent_cache_guest_page(vcpu, hva & PMD_MASK, PMD_SIZE);
ret = stage2_set_pmd_huge(kvm, memcache, fault_ipa, &new_pmd);
} else {
pte_t new_pte = pfn_pte(pfn, PAGE_S2);
@@ -721,7 +721,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
kvm_set_s2pte_writable(&new_pte);
kvm_set_pfn_dirty(pfn);
}
-   coherent_icache_guest_page(kvm, hva, PAGE_SIZE);
+   coherent_cache_guest_page(vcpu, hva, PAGE_SIZE);
ret = stage2_set_pte(kvm, memcache, fault_ipa, &new_pte, false);
}
 
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 680f74e..2232dd0 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -106,7 +106,6 @@ static inline bool kvm_is_write_fault(unsigned long esr)
return true;
 }
 
-static inline void kvm_clean_dcache_area(void *addr, size_t size) {}
 static inline void kvm_clean_pgd(pgd_t *pgd) {}
 static inline void kvm_clean_pmd_entry(pmd_t *pmd) {}
 static inline void kvm_clean_pte(pte_t *pte) {}
@@ -124,9 +123,14 @@ static inline void kvm_set_s2pmd_writable(pmd_t *pmd)
 
 struct kvm;
 
-static inline void coherent_icache_guest_page(struct kvm *kvm, hva_t hva,
- unsigned long size)
+#define kvm_flush_dcache_to_poc(a,l)   __flush_dcache_area((a), (l))
+
+static inline void coherent_cache_guest_page(struct kvm_vcpu *vcpu, hva_t hva,
+unsigned long size)
 {
+   if ((vcpu_sys_reg(vcpu, SCTLR_EL1) & 0b101) != 0b101)
+   kvm_flush_dcache_to_poc((void *)hva, size);
+
if (!icache_is_aliasing()) {/* PIPT */
flush_icache_range(hva, hva + size);
} else if (!icache_is_aivivt()) {   /* non ASID-tagged VIVT */
@@ -135,7 +139,6 @@ static inline void coherent_icache_guest_page(struct kvm 
*kvm, hva_t hva,
}
 }
 
-#define kvm_flush_dcache_to_poc(a,l)   __flush_dcache_area((a), (l))
 
 #endif /* __ASSEMBLY__ */
 #endif /* __ARM64_KVM_MMU_H__ */
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 05/10] ARM: KVM: force cache clean on page fault when caches are off

2014-01-22 Thread Marc Zyngier
In order for the guest with caches off to observe data written
contained in a given page, we need to make sure that page is
committed to memory, and not just hanging in the cache (as
guest accesses are completely bypassing the cache until it
decides to enable it).

For this purpose, hook into the coherent_cache_guest_page
function and flush the region if the guest SCTLR
register doesn't show the MMU and caches as being enabled.

Signed-off-by: Marc Zyngier 
---
 arch/arm/include/asm/kvm_mmu.h | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index cbab9ba..fa023e2 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -116,9 +116,14 @@ static inline void kvm_set_s2pmd_writable(pmd_t *pmd)
 
 struct kvm;
 
+#define kvm_flush_dcache_to_poc(a,l)   __cpuc_flush_dcache_area((a), (l))
+
 static inline void coherent_cache_guest_page(struct kvm_vcpu *vcpu, hva_t hva,
 unsigned long size)
 {
+   if ((vcpu->arch.cp15[c1_SCTLR] & 0b101) != 0b101)
+   kvm_flush_dcache_to_poc((void *)hva, size);
+   
/*
 * If we are going to insert an instruction page and the icache is
 * either VIPT or PIPT, there is a potential problem where the host
@@ -139,8 +144,6 @@ static inline void coherent_cache_guest_page(struct 
kvm_vcpu *vcpu, hva_t hva,
}
 }
 
-#define kvm_flush_dcache_to_poc(a,l)   __cpuc_flush_dcache_area((a), (l))
-
 void stage2_flush_vm(struct kvm *kvm);
 
 #endif /* !__ASSEMBLY__ */
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 07/10] ARM: KVM: fix ordering of 64bit coprocessor accesses

2014-01-22 Thread Marc Zyngier
Commit 240e99cbd00a (ARM: KVM: Fix 64-bit coprocessor handling)
added an ordering dependency for the 64bit registers.

The order described is: CRn, CRm, Op1, Op2, 64bit-first.

Unfortunately, the implementation is: CRn, 64bit-first, CRm...

Move the 64bit test to be last in order to match the documentation.

Signed-off-by: Marc Zyngier 
---
 arch/arm/kvm/coproc.h | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/arm/kvm/coproc.h b/arch/arm/kvm/coproc.h
index c5ad7ff..1a44bbe 100644
--- a/arch/arm/kvm/coproc.h
+++ b/arch/arm/kvm/coproc.h
@@ -135,13 +135,13 @@ static inline int cmp_reg(const struct coproc_reg *i1,
return -1;
if (i1->CRn != i2->CRn)
return i1->CRn - i2->CRn;
-   if (i1->is_64 != i2->is_64)
-   return i2->is_64 - i1->is_64;
if (i1->CRm != i2->CRm)
return i1->CRm - i2->CRm;
if (i1->Op1 != i2->Op1)
return i1->Op1 - i2->Op1;
-   return i1->Op2 - i2->Op2;
+   if (i1->Op2 != i2->Op2)
+   return i1->Op2 - i2->Op2;
+   return i2->is_64 - i1->is_64;
 }
 
 
@@ -153,4 +153,8 @@ static inline int cmp_reg(const struct coproc_reg *i1,
 #define is64   .is_64 = true
 #define is32   .is_64 = false
 
+bool access_sctlr(struct kvm_vcpu *vcpu,
+ const struct coproc_params *p,
+ const struct coproc_reg *r);
+
 #endif /* __ARM_KVM_COPROC_LOCAL_H__ */
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 03/10] arm64: KVM: trap VM system registers until MMU and caches are ON

2014-01-22 Thread Marc Zyngier
In order to be able to detect the point where the guest enables
its MMU and caches, trap all the VM related system registers.

Once we see the guest enabling both the MMU and the caches, we
can go back to a saner mode of operation, which is to leave these
registers in complete control of the guest.

Signed-off-by: Marc Zyngier 
Reviewed-by: Catalin Marinas 
---
 arch/arm64/include/asm/kvm_arm.h |  3 +-
 arch/arm64/include/asm/kvm_asm.h |  3 +-
 arch/arm64/kvm/sys_regs.c| 99 +++-
 3 files changed, 91 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index c98ef47..fd0a651 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -62,6 +62,7 @@
  * RW: 64bit by default, can be overriden for 32bit VMs
  * TAC:Trap ACTLR
  * TSC:Trap SMC
+ * TVM:Trap VM ops (until M+C set in SCTLR_EL1)
  * TSW:Trap cache operations by set/way
  * TWE:Trap WFE
  * TWI:Trap WFI
@@ -74,7 +75,7 @@
  * SWIO:   Turn set/way invalidates into set/way clean+invalidate
  */
 #define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | HCR_VM | \
-HCR_BSU_IS | HCR_FB | HCR_TAC | \
+HCR_TVM | HCR_BSU_IS | HCR_FB | HCR_TAC | \
 HCR_AMO | HCR_IMO | HCR_FMO | \
 HCR_SWIO | HCR_TIDCP | HCR_RW)
 #define HCR_VIRT_EXCP_MASK (HCR_VA | HCR_VI | HCR_VF)
diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 3d796b4..89d7796 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -81,7 +81,8 @@
 #define c13_TID_URW(TPIDR_EL0 * 2) /* Thread ID, User R/W */
 #define c13_TID_URO(TPIDRRO_EL0 * 2)/* Thread ID, User R/O */
 #define c13_TID_PRIV   (TPIDR_EL1 * 2) /* Thread ID, Privileged */
-#define c10_AMAIR  (AMAIR_EL1 * 2) /* Aux Memory Attr Indirection Reg */
+#define c10_AMAIR0 (AMAIR_EL1 * 2) /* Aux Memory Attr Indirection Reg */
+#define c10_AMAIR1 (c10_AMAIR0 + 1)/* Aux Memory Attr Indirection Reg */
 #define c14_CNTKCTL(CNTKCTL_EL1 * 2) /* Timer Control Register (PL1) */
 #define NR_CP15_REGS   (NR_SYS_REGS * 2)
 
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index f063750..1a4731b 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -121,6 +121,55 @@ done:
 }
 
 /*
+ * Generic accessor for VM registers. Only called as long as HCR_TVM
+ * is set.
+ */
+static bool access_vm_reg(struct kvm_vcpu *vcpu,
+ const struct sys_reg_params *p,
+ const struct sys_reg_desc *r)
+{
+   unsigned long val;
+
+   BUG_ON(!p->is_write);
+
+   val = *vcpu_reg(vcpu, p->Rt);
+   if (!p->is_aarch32) {
+   vcpu_sys_reg(vcpu, r->reg) = val;
+   } else {
+   vcpu_cp15(vcpu, r->reg) = val & 0xUL;
+   if (!p->is_32bit)
+   vcpu_cp15(vcpu, r->reg + 1) = val >> 32;
+   }
+   return true;
+}
+
+/*
+ * SCTLR_EL1 accessor. Only called as long as HCR_TVM is set.  If the
+ * guest enables the MMU, we stop trapping the VM sys_regs and leave
+ * it in complete control of the caches.
+ */
+static bool access_sctlr(struct kvm_vcpu *vcpu,
+const struct sys_reg_params *p,
+const struct sys_reg_desc *r)
+{
+   unsigned long val;
+
+   BUG_ON(!p->is_write);
+
+   val = *vcpu_reg(vcpu, p->Rt);
+
+   if (!p->is_aarch32)
+   vcpu_sys_reg(vcpu, r->reg) = val;
+   else
+   vcpu_cp15(vcpu, r->reg) = val & 0xUL;
+
+   if ((val & (0b101)) == 0b101)   /* MMU+Caches enabled? */
+   vcpu->arch.hcr_el2 &= ~HCR_TVM;
+
+   return true;
+}
+
+/*
  * We could trap ID_DFR0 and tell the guest we don't support performance
  * monitoring.  Unfortunately the patch to make the kernel check ID_DFR0 was
  * NAKed, so it will read the PMCR anyway.
@@ -185,32 +234,32 @@ static const struct sys_reg_desc sys_reg_descs[] = {
  NULL, reset_mpidr, MPIDR_EL1 },
/* SCTLR_EL1 */
{ Op0(0b11), Op1(0b000), CRn(0b0001), CRm(0b), Op2(0b000),
- NULL, reset_val, SCTLR_EL1, 0x00C50078 },
+ access_sctlr, reset_val, SCTLR_EL1, 0x00C50078 },
/* CPACR_EL1 */
{ Op0(0b11), Op1(0b000), CRn(0b0001), CRm(0b), Op2(0b010),
  NULL, reset_val, CPACR_EL1, 0 },
/* TTBR0_EL1 */
{ Op0(0b11), Op1(0b000), CRn(0b0010), CRm(0b), Op2(0b000),
- NULL, reset_unknown, TTBR0_EL1 },
+ access_vm_reg, reset_unknown, TTBR0_EL1 },
/* TTBR1_EL1 */
{ Op0(0b11), Op1(0b000), CRn(0b0010), CRm(0b), Op2(0b001),
- NULL, reset_unknown, TTBR1_EL1 },
+ access_vm_reg, reset_unknown, TTBR1_EL1 },
/* TCR_E

[PATCH v2 06/10] ARM: KVM: fix handling of trapped 64bit coprocessor accesses

2014-01-22 Thread Marc Zyngier
Commit 240e99cbd00a (ARM: KVM: Fix 64-bit coprocessor handling)
changed the way we match the 64bit coprocessor access from
user space, but didn't update the trap handler for the same
set of registers.

The effect is that a trapped 64bit access is never matched, leading
to a fault being injected into the guest. This went unnoticed as we
didn;t really trap any 64bit register so far.

Placing the CRm field of the access into the CRn field of the matching
structure fixes the problem. Also update the debug feature to emit the
expected string in case of failing match.

Signed-off-by: Marc Zyngier 
---
 arch/arm/kvm/coproc.c | 4 ++--
 arch/arm/kvm/coproc.h | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/arm/kvm/coproc.c b/arch/arm/kvm/coproc.c
index 78c0885..126c90d 100644
--- a/arch/arm/kvm/coproc.c
+++ b/arch/arm/kvm/coproc.c
@@ -443,7 +443,7 @@ int kvm_handle_cp15_64(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
 {
struct coproc_params params;
 
-   params.CRm = (kvm_vcpu_get_hsr(vcpu) >> 1) & 0xf;
+   params.CRn = (kvm_vcpu_get_hsr(vcpu) >> 1) & 0xf;
params.Rt1 = (kvm_vcpu_get_hsr(vcpu) >> 5) & 0xf;
params.is_write = ((kvm_vcpu_get_hsr(vcpu) & 1) == 0);
params.is_64bit = true;
@@ -451,7 +451,7 @@ int kvm_handle_cp15_64(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
params.Op1 = (kvm_vcpu_get_hsr(vcpu) >> 16) & 0xf;
params.Op2 = 0;
params.Rt2 = (kvm_vcpu_get_hsr(vcpu) >> 10) & 0xf;
-   params.CRn = 0;
+   params.CRm = 0;
 
return emulate_cp15(vcpu, ¶ms);
 }
diff --git a/arch/arm/kvm/coproc.h b/arch/arm/kvm/coproc.h
index 0461d5c..c5ad7ff 100644
--- a/arch/arm/kvm/coproc.h
+++ b/arch/arm/kvm/coproc.h
@@ -58,8 +58,8 @@ static inline void print_cp_instr(const struct coproc_params 
*p)
 {
/* Look, we even formatted it for you to paste into the table! */
if (p->is_64bit) {
-   kvm_pr_unimpl(" { CRm(%2lu), Op1(%2lu), is64, func_%s },\n",
- p->CRm, p->Op1, p->is_write ? "write" : "read");
+   kvm_pr_unimpl(" { CRm64(%2lu), Op1(%2lu), is64, func_%s },\n",
+ p->CRn, p->Op1, p->is_write ? "write" : "read");
} else {
kvm_pr_unimpl(" { CRn(%2lu), CRm(%2lu), Op1(%2lu), Op2(%2lu), 
is32,"
  " func_%s },\n",
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 02/10] arm64: KVM: allows discrimination of AArch32 sysreg access

2014-01-22 Thread Marc Zyngier
The current handling of AArch32 trapping is slightly less than
perfect, as it is not possible (from a handler point of view)
to distinguish it from an AArch64 access, nor to tell a 32bit
from a 64bit access either.

Fix this by introducing two additional flags:
- is_aarch32: true if the access was made in AArch32 mode
- is_32bit: true if is_aarch32 == true and a MCR/MRC instruction
  was used to perform the access (as opposed to MCRR/MRRC).

This allows a handler to cover all the possible conditions in which
a system register gets trapped.

Signed-off-by: Marc Zyngier 
---
 arch/arm64/kvm/sys_regs.c | 5 +
 arch/arm64/kvm/sys_regs.h | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 02e9d09..f063750 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -437,6 +437,8 @@ int kvm_handle_cp15_64(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
u32 hsr = kvm_vcpu_get_hsr(vcpu);
int Rt2 = (hsr >> 10) & 0xf;
 
+   params.is_aarch32 = true;
+   params.is_32bit = false;
params.CRm = (hsr >> 1) & 0xf;
params.Rt = (hsr >> 5) & 0xf;
params.is_write = ((hsr & 1) == 0);
@@ -480,6 +482,8 @@ int kvm_handle_cp15_32(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
struct sys_reg_params params;
u32 hsr = kvm_vcpu_get_hsr(vcpu);
 
+   params.is_aarch32 = true;
+   params.is_32bit = true;
params.CRm = (hsr >> 1) & 0xf;
params.Rt  = (hsr >> 5) & 0xf;
params.is_write = ((hsr & 1) == 0);
@@ -549,6 +553,7 @@ int kvm_handle_sys_reg(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
struct sys_reg_params params;
unsigned long esr = kvm_vcpu_get_hsr(vcpu);
 
+   params.is_aarch32 = false;
params.Op0 = (esr >> 20) & 3;
params.Op1 = (esr >> 14) & 0x7;
params.CRn = (esr >> 10) & 0xf;
diff --git a/arch/arm64/kvm/sys_regs.h b/arch/arm64/kvm/sys_regs.h
index d50d372..d411e25 100644
--- a/arch/arm64/kvm/sys_regs.h
+++ b/arch/arm64/kvm/sys_regs.h
@@ -30,6 +30,8 @@ struct sys_reg_params {
u8  Op2;
u8  Rt;
boolis_write;
+   boolis_aarch32;
+   boolis_32bit;   /* Only valid if is_aarch32 is true */
 };
 
 struct sys_reg_desc {
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 04/10] arm64: KVM: flush VM pages before letting the guest enable caches

2014-01-22 Thread Marc Zyngier
When the guest runs with caches disabled (like in an early boot
sequence, for example), all the writes are diectly going to RAM,
bypassing the caches altogether.

Once the MMU and caches are enabled, whatever sits in the cache
becomes suddently visible, which isn't what the guest expects.

A way to avoid this potential disaster is to invalidate the cache
when the MMU is being turned on. For this, we hook into the SCTLR_EL1
trapping code, and scan the stage-2 page tables, invalidating the
pages/sections that have already been mapped in.

Signed-off-by: Marc Zyngier 
Reviewed-by: Catalin Marinas 
---
 arch/arm/include/asm/kvm_mmu.h   |  2 +
 arch/arm/kvm/mmu.c   | 83 
 arch/arm64/include/asm/kvm_mmu.h |  1 +
 arch/arm64/kvm/sys_regs.c|  5 ++-
 4 files changed, 90 insertions(+), 1 deletion(-)

diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index f997b9e..cbab9ba 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -141,6 +141,8 @@ static inline void coherent_cache_guest_page(struct 
kvm_vcpu *vcpu, hva_t hva,
 
 #define kvm_flush_dcache_to_poc(a,l)   __cpuc_flush_dcache_area((a), (l))
 
+void stage2_flush_vm(struct kvm *kvm);
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ARM_KVM_MMU_H__ */
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 415fd63..52f8b7d 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -187,6 +187,89 @@ static void unmap_range(struct kvm *kvm, pgd_t *pgdp,
}
 }
 
+void stage2_flush_ptes(struct kvm *kvm, pmd_t *pmd,
+  unsigned long addr, unsigned long end)
+{
+   pte_t *pte;
+
+   pte = pte_offset_kernel(pmd, addr);
+   do {
+   if (!pte_none(*pte)) {
+   hva_t hva = gfn_to_hva(kvm, addr >> PAGE_SHIFT);
+   kvm_flush_dcache_to_poc((void*)hva, PAGE_SIZE);
+   }
+   } while(pte++, addr += PAGE_SIZE, addr != end);
+}
+
+void stage2_flush_pmds(struct kvm *kvm, pud_t *pud,
+  unsigned long addr, unsigned long end)
+{
+   pmd_t *pmd;
+   unsigned long next;
+
+   pmd = pmd_offset(pud, addr);
+   do {
+   next = pmd_addr_end(addr, end);
+   if (!pmd_none(*pmd)) {
+   if (kvm_pmd_huge(*pmd)) {
+   hva_t hva = gfn_to_hva(kvm, addr >> PAGE_SHIFT);
+   kvm_flush_dcache_to_poc((void*)hva, PMD_SIZE);
+   } else {
+   stage2_flush_ptes(kvm, pmd, addr, next);
+   }
+   }
+   } while(pmd++, addr = next, addr != end);
+}
+
+void stage2_flush_puds(struct kvm *kvm, pgd_t *pgd,
+  unsigned long addr, unsigned long end)
+{
+   pud_t *pud;
+   unsigned long next;
+
+   pud = pud_offset(pgd, addr);
+   do {
+   next = pud_addr_end(addr, end);
+   if (!pud_none(*pud)) {
+   if (pud_huge(*pud)) {
+   hva_t hva = gfn_to_hva(kvm, addr >> PAGE_SHIFT);
+   kvm_flush_dcache_to_poc((void*)hva, PUD_SIZE);
+   } else {
+   stage2_flush_pmds(kvm, pud, addr, next);
+   }
+   }
+   } while(pud++, addr = next, addr != end);
+}
+
+static void stage2_flush_memslot(struct kvm *kvm,
+struct kvm_memory_slot *memslot)
+{
+   unsigned long addr = memslot->base_gfn << PAGE_SHIFT;
+   unsigned long end = addr + PAGE_SIZE * memslot->npages;
+   unsigned long next;
+   pgd_t *pgd;
+
+   pgd = kvm->arch.pgd + pgd_index(addr);
+   do {
+   next = pgd_addr_end(addr, end);
+   stage2_flush_puds(kvm, pgd, addr, next);
+   } while(pgd++, addr = next, addr != end);
+}
+
+void stage2_flush_vm(struct kvm *kvm)
+{
+   struct kvm_memslots *slots;
+   struct kvm_memory_slot *memslot;
+
+   spin_lock(&kvm->mmu_lock);
+
+   slots = kvm_memslots(kvm);
+   kvm_for_each_memslot(memslot, slots)
+   stage2_flush_memslot(kvm, memslot);
+
+   spin_unlock(&kvm->mmu_lock);
+}
+
 /**
  * free_boot_hyp_pgd - free HYP boot page tables
  *
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 2232dd0..b7b2ca3 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -139,6 +139,7 @@ static inline void coherent_cache_guest_page(struct 
kvm_vcpu *vcpu, hva_t hva,
}
 }
 
+void stage2_flush_vm(struct kvm *kvm);
 
 #endif /* __ASSEMBLY__ */
 #endif /* __ARM64_KVM_MMU_H__ */
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 1a4731b..3d27bb8 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #in

[PATCH v2 00/10] arm/arm64: KVM: host cache maintainance when guest caches are off

2014-01-22 Thread Marc Zyngier
When we run a guest with cache disabled, we don't flush the cache to
the Point of Coherency, hence possibly missing bits of data that have
been written in the cache, but have not yet reached memory.

We also have the opposite issue: when a guest enables its cache,
whatever sits in the cache is suddenly going to become visible,
shadowing whatever the guest has written into RAM.

There are several approaches to these issues:
- Using the DC bit when caches are off: this breaks guests assuming
  caches off while doing DMA operations. Bootloaders, for example.
  It also breaks the I-D coherency.
- Fetch the memory attributes on translation fault, and flush the
  cache while handling the fault. This relies on using the PAR_EL1
  register to obtain the Stage-1 memory attributes, and tends to be
  slow.
- Detecting the translation faults occuring with MMU off (and
  performing a cache clean), and trapping SCTLR_EL1 to detect the
  moment when the guest is turning its caches on (and performing a
  cache invalidation). Trapping of SCTLR_EL1 is then disabled to
  ensure the best performance.

This patch series implements the last solution, for both arm and
arm64. Tested on TC2 (ARMv7) and FVP model (ARMv8).

>From v1 (http://www.spinics.net/lists/kvm/msg99404.html):
- Fixed AArch32 VM handling on arm64 (Reported by Anup)
- Added ARMv7 support:
  * Fixed a couple of issues regarding handling of 64bit cp15 regs
  * Per-vcpu HCR
  * Switching of AMAIR0 and AMAIR1

Marc Zyngier (10):
  arm64: KVM: force cache clean on page fault when caches are off
  arm64: KVM: allows discrimination of AArch32 sysreg access
  arm64: KVM: trap VM system registers until MMU and caches are ON
  arm64: KVM: flush VM pages before letting the guest enable caches
  ARM: KVM: force cache clean on page fault when caches are off
  ARM: KVM: fix handling of trapped 64bit coprocessor accesses
  ARM: KVM: fix ordering of 64bit coprocessor accesses
  ARM: KVM: introduce per-vcpu HYP Configuration Register
  ARM: KVM: trap VM system registers until MMU and caches are ON
  ARM: KVM: add world-switch for AMAIR{0,1}

 arch/arm/include/asm/kvm_arm.h   |   4 +-
 arch/arm/include/asm/kvm_asm.h   |   4 +-
 arch/arm/include/asm/kvm_host.h  |   9 ++--
 arch/arm/include/asm/kvm_mmu.h   |  11 ++--
 arch/arm/kernel/asm-offsets.c|   1 +
 arch/arm/kvm/coproc.c|  95 +++---
 arch/arm/kvm/coproc.h|  14 +++--
 arch/arm/kvm/coproc_a15.c|   2 +-
 arch/arm/kvm/coproc_a7.c |   2 +-
 arch/arm/kvm/guest.c |   1 +
 arch/arm/kvm/interrupts_head.S   |  21 +---
 arch/arm/kvm/mmu.c   |  87 ++-
 arch/arm64/include/asm/kvm_arm.h |   3 +-
 arch/arm64/include/asm/kvm_asm.h |   3 +-
 arch/arm64/include/asm/kvm_mmu.h |  12 +++--
 arch/arm64/kvm/sys_regs.c| 107 ++-
 arch/arm64/kvm/sys_regs.h|   2 +
 17 files changed, 316 insertions(+), 62 deletions(-)

-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] First round of KVM updates for 3.14

2014-01-22 Thread Paolo Bonzini
Linus,

The following changes since commit dc1ccc48159d63eca5089e507c82c7d22ef60839:

  Linux 3.13-rc2 (2013-11-29 12:57:14 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm.git tags/for-linus

for you to fetch changes up to 7650b6870930055426abb32cc47d164ccdea49db:

  Merge branch 'kvm-urgent' of 
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into kvm-queue 
(2014-01-20 13:00:02 +0100)



First round of KVM updates for 3.14; PPC parts will come next week.
Nothing major here, just bugfixes all over the place.  The most
interesting part is the ARM guys' virtualized interrupt controller
overhaul, which lets userspace get/set the state and thus enables
migration of ARM VMs.


Andre Przywara (1):
  ARM/KVM: save and restore generic timer registers

Anup Patel (4):
  KVM: Documentation: Fix typo for KVM_ARM_VCPU_INIT ioctl
  arm64: KVM: Add Kconfig option for max VCPUs per-Guest
  arm64: KVM: Support X-Gene guest VCPU on APM X-Gene host
  arm64: KVM: Force undefined exception for Guest SMC intructions

Chen Fan (1):
  KVM: x86: Fix debug typo error in lapic

Christian Borntraeger (1):
  KVM: s390: Fix memory access error detection

Christoffer Dall (12):
  arm: KVM: Don't return PSCI_INVAL if waitqueue is inactive
  arm/arm64: KVM: arch_timer: Initialize cntvoff at kvm_init
  ARM: KVM: Allow creating the VGIC after VCPUs
  KVM: arm-vgic: Support KVM_CREATE_DEVICE for VGIC
  KVM: arm-vgic: Set base addr through device API
  irqchip: arm-gic: Define additional MMIO offsets and masks
  KVM: arm-vgic: Make vgic mmio functions more generic
  arm/arm64: kvm: Set vcpu->cpu to -1 on vcpu_put
  KVM: arm-vgic: Add vgic reg access from dev attr
  KVM: arm-vgic: Support unqueueing of LRs to the dist
  KVM: arm-vgic: Add GICD_SPENDSGIR and GICD_CPENDSGIR handlers
  KVM: arm-vgic: Support CPU interface reg access

Cornelia Huck (1):
  KVM: s390: diagnose call documentation

Dominik Dingel (1):
  KVM: s390: ioeventfd: ignore leftmost bits

Gleb Natapov (2):
  KVM: Change maintainer email address
  KVM: VMX: shadow VM_(ENTRY|EXIT)_CONTROLS vmcs field

Heiko Carstens (1):
  KVM: s390: fix diagnose code extraction

Jan Kiszka (12):
  KVM: nVMX: Add support for activity state HLT
  KVM: nVMX: Support direct APIC access from L2
  KVM: VMX: Do not skip the instruction if handle_dr injects a fault
  KVM: x86: Sync DR7 on KVM_SET_DEBUGREGS
  KVM: SVM: Fix reading of DR6
  KVM: VMX: Fix DR6 update on #DB exception
  KVM: nVMX: Leave VMX mode on clearing of feature control MSR
  KVM: nVMX: Pass vmexit parameters to nested_vmx_vmexit
  KVM: nVMX: Add tracepoints for nested_vmexit and nested_vmexit_inject
  KVM: nVMX: Clean up handling of VMX-related MSRs
  KVM: nVMX: Fix nested_run_pending on activity state HLT
  KVM: nVMX: Update guest activity state field on L2 exits

Marc Zyngier (3):
  Merge tag 'vgic-migrate-for-marc' of 
git://git.linaro.org/people/christoffer.dall/linux-kvm-arm into kvm-arm64/next
  Merge branch 'kvm-arm64/for-3.14' into kvm-arm64/next
  arm/arm64: KVM: relax the requirements of VMA alignment for THP

Marcelo Tosatti (5):
  KVM: MMU: handle invalid root_hpa at __direct_map
  KVM: VMX: fix use after free of vmx->loaded_vmcs
  KVM: x86: handle invalid root_hpa everywhere
  KVM: x86: limit PIT timer frequency
  KVM: x86: fix tsc catchup issue with tsc scaling

Masanari Iida (1):
  KVM: doc: Fix typo in doc/virtual/kvm

Paolo Bonzini (6):
  Merge tag 'kvm-s390-20131128' of git://git.kernel.org/.../kvms390/linux 
into kvm-next
  Merge remote-tracking branch 'tip/x86/cpufeature' into kvm-next
  Merge tag 'kvm-s390-20131211' of git://git.kernel.org/.../kvms390/linux 
into kvm-next
  Merge tag 'kvm-arm-for-3.14' of 
git://git.linaro.org/people/christoffer.dall/linux-kvm-arm into kvm-queue
  KVM: remove useless write to vcpu->hv_clock.tsc_timestamp
  Merge branch 'kvm-urgent' of git://git.kernel.org/.../kvms390/linux into 
kvm-queue

Paul Bolle (1):
  kvm: vfio: silence GCC warning

Qiaowei Ren (2):
  x86, cpufeature: Define the Intel MPX feature flag
  x86, xsave: Support eager-only xsave features, add MPX support

Randy Dunlap (1):
  kvm: make KVM_MMU_AUDIT help text more readable

Sachin Kamat (1):
  KVM: ARM: Remove duplicate include

Santosh Shilimkar (1):
  arm/arm64: kvm: Use virt_to_idmap instead of virt_to_phys for idmap 
mappings

Scott Wood (1):
  kvm: Provide kvm_vcpu_eligible_for_directed_yield() stub

Stephen Hemminger (2):
  kvm: make local functions static
  kvm: remove dead code

Takuya Yoshikawa (2):
  KVM: Use cond_resched() directly and remove useless kvm_resched()
  KVM: x86: 

Re: [BUG] sched: tip/master show soft lockup while running multiple VM

2014-01-22 Thread Paolo Bonzini

Il 22/01/2014 13:36, Peter Zijlstra ha scritto:

On Wed, Jan 22, 2014 at 04:27:45PM +0800, Michael wang wrote:

# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set


Could you try the patch here:

  lkml.kernel.org/r/20140122102435.gh31...@twins.programming.kicks-ass.net

I suspect its the same issue.



Yes, looks like the same thing.  Thanks Peter.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: print suberror on all internal errors

2014-01-22 Thread Paolo Bonzini

Il 21/01/2014 18:11, Radim Krčmář ha scritto:

KVM introduced internal error exit reason and suberror at the same time,
and later extended it with internal error data.
QEMU does not report suberror on hosts between these two events because
we check for the extension. (half a year in 2009, but it is misleading)

Fix by removing KVM_CAP_INTERNAL_ERROR_DATA condition on printf.

(partially improved by bb44e0d12df70 and ba4047cf848a3 in the past)

Signed-off-by: Radim Krčmář 
---
 kvm-all.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index 0bfb060..0a91d8e 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1539,17 +1539,16 @@ static void kvm_handle_io(uint16_t port, void *data, 
int direction, int size,

 static int kvm_handle_internal_error(CPUState *cpu, struct kvm_run *run)
 {
-fprintf(stderr, "KVM internal error.");
+fprintf(stderr, "KVM internal error. Suberror: %d\n",
+run->internal.suberror);
+
 if (kvm_check_extension(kvm_state, KVM_CAP_INTERNAL_ERROR_DATA)) {
 int i;

-fprintf(stderr, " Suberror: %d\n", run->internal.suberror);
 for (i = 0; i < run->internal.ndata; ++i) {
 fprintf(stderr, "extra data[%d]: %"PRIx64"\n",
 i, (uint64_t)run->internal.data[i]);
 }
-} else {
-fprintf(stderr, "\n");
 }
 if (run->internal.suberror == KVM_INTERNAL_ERROR_EMULATION) {
 fprintf(stderr, "emulation failure\n");



Applied to uq/master, thanks (please remember uq/master in the subject 
line in the future).


Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] sched: tip/master show soft lockup while running multiple VM

2014-01-22 Thread Peter Zijlstra
On Wed, Jan 22, 2014 at 04:27:45PM +0800, Michael wang wrote:
> # CONFIG_PREEMPT_NONE is not set
> CONFIG_PREEMPT_VOLUNTARY=y
> # CONFIG_PREEMPT is not set

Could you try the patch here:

  lkml.kernel.org/r/20140122102435.gh31...@twins.programming.kicks-ass.net

I suspect its the same issue.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/4] X86/KVM: enable Intel MPX for KVM

2014-01-22 Thread Paolo Bonzini

Il 22/01/2014 06:29, Liu, Jinsong ha scritto:

These patches are version 3 to enalbe Intel MPX for KVM.

Version 1:
  * Add some Intel MPX definiation
  * Fix a cpuid(0x0d, 0) exposing bug, dynamic per XCR0 features enable/disable
  * vmx and msr handle for MPX support at KVM
  * enalbe MPX feature for guest

Version 2:
  * remove generic MPX definiation, Qiaowei's patch has add the definiation at 
kernel side
  * add MSR_IA32_BNDCFGS to msrs_to_save

Version 3:
  * rebase on latest kernel, which include Qiaowei's MPX common definiation 
pulled from HPA's tree


I am afraid there is still some work to do on these patches, so they 
need to be delayed to 3.15.


Patch 1:
this seems mostly separate from the rest of the MPX work.  I
commented on the missing "ULL" suffix, but I would also like to
understand why you put this patch in this series.

Patch 2:
As remarked in the reply to this patch:
- the vmx_disable_intercept_for_msr has to be unconditional
- you need a new kvm_x86_ops member mpx_supported, to disable
MPX whenever the two VMX controls are not available.

Patch 3:
this patch needs to be rebased.  Apart from that it is fine,
but please move the VMX bits together with patch 2, and the
other bits together with patch 4.

Patch 4:
this patch needs to be rebased and to use the new mpx_supported
member

If you also want to look at nested VMX support for MPX, that would be 
nice.  It should not be hard.  Otherwise we can take care of that later.


Thanks for your work,

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/4] KVM/X86: Intel MPX vmx and msr handle

2014-01-22 Thread Paolo Bonzini

Il 22/01/2014 12:38, Paolo Bonzini ha scritto:

Il 21/01/2014 20:01, Liu, Jinsong ha scritto:

From 31e68d752ac395dc6b65e6adf45be5324e92cdc8 Mon Sep 17 00:00:00 2001
From: Liu Jinsong 
Date: Fri, 13 Dec 2013 02:32:43 +0800
Subject: [PATCH v3 2/4] KVM/X86: Intel MPX vmx and msr handle

This patch handle vmx and msr of Intel MPX feature.

Signed-off-by: Xudong Hao 
Signed-off-by: Liu Jinsong 
---
 arch/x86/include/asm/vmx.h|2 ++
 arch/x86/include/uapi/asm/msr-index.h |1 +
 arch/x86/kvm/vmx.c|   12 ++--
 3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 966502d..1bf4681 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -85,6 +85,7 @@
 #define VM_EXIT_SAVE_IA32_EFER  0x0010
 #define VM_EXIT_LOAD_IA32_EFER  0x0020
 #define VM_EXIT_SAVE_VMX_PREEMPTION_TIMER   0x0040
+#define VM_EXIT_CLEAR_BNDCFGS   0x0080

 #define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR0x00036dff

@@ -95,6 +96,7 @@
 #define VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL 0x2000
 #define VM_ENTRY_LOAD_IA32_PAT0x4000
 #define VM_ENTRY_LOAD_IA32_EFER 0x8000
+#define VM_ENTRY_LOAD_BNDCFGS   0x0001

 #define VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR0x11ff

diff --git a/arch/x86/include/uapi/asm/msr-index.h
b/arch/x86/include/uapi/asm/msr-index.h
index 37813b5..2a418c4 100644
--- a/arch/x86/include/uapi/asm/msr-index.h
+++ b/arch/x86/include/uapi/asm/msr-index.h
@@ -294,6 +294,7 @@
 #define MSR_SMI_COUNT0x0034
 #define MSR_IA32_FEATURE_CONTROL0x003a
 #define MSR_IA32_TSC_ADJUST 0x003b
+#define MSR_IA32_BNDCFGS0x0d90

 #define FEATURE_CONTROL_LOCKED(1<<0)
 #define FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX(1<<1)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index b2fe1c2..6d7d9ad 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -439,6 +439,7 @@ struct vcpu_vmx {
 #endif
 int   gs_ldt_reload_needed;
 int   fs_reload_needed;
+u64   msr_host_bndcfgs;
 } host_state;
 struct {
 int vm86_active;
@@ -1647,6 +1648,8 @@ static void vmx_save_host_state(struct kvm_vcpu
*vcpu)
 if (is_long_mode(&vmx->vcpu))
 wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_guest_kernel_gs_base);
 #endif
+if (boot_cpu_has(X86_FEATURE_MPX))
+rdmsrl(MSR_IA32_BNDCFGS, vmx->host_state.msr_host_bndcfgs);
 for (i = 0; i < vmx->save_nmsrs; ++i)
 kvm_set_shared_msr(vmx->guest_msrs[i].index,
vmx->guest_msrs[i].data,
@@ -1684,6 +1687,8 @@ static void __vmx_load_host_state(struct
vcpu_vmx *vmx)
 #ifdef CONFIG_X86_64
 wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_host_kernel_gs_base);
 #endif
+if (vmx->host_state.msr_host_bndcfgs)
+wrmsrl(MSR_IA32_BNDCFGS, vmx->host_state.msr_host_bndcfgs);
 /*
  * If the FPU is not active (through the host task or
  * the guest vcpu), then restore the cr0.TS bit.
@@ -2800,7 +2805,7 @@ static __init int setup_vmcs_config(struct
vmcs_config *vmcs_conf)
 min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
 #endif
 opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT |
-VM_EXIT_ACK_INTR_ON_EXIT;
+VM_EXIT_ACK_INTR_ON_EXIT | VM_EXIT_CLEAR_BNDCFGS;
 if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_EXIT_CTLS,
 &_vmexit_control) < 0)
 return -EIO;
@@ -2817,7 +2822,7 @@ static __init int setup_vmcs_config(struct
vmcs_config *vmcs_conf)
 _pin_based_exec_control &= ~PIN_BASED_POSTED_INTR;

 min = 0;
-opt = VM_ENTRY_LOAD_IA32_PAT;
+opt = VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS;
 if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_ENTRY_CTLS,
 &_vmentry_control) < 0)
 return -EIO;


You need to disable MPX in the guest if the two controls are not
available.  You can do this, for example, in vmx_cpuid_update.


Better: add a mpx_supported field to struct kvm_x86_ops.  You can use 
invpcid_supported as a model.



Otherwise, nested VMX is broken.



@@ -8636,6 +8641,9 @@ static int __init vmx_init(void)
 vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_CS, false);
 vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_ESP, false);
 vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_EIP, false);
+if (boot_cpu_has(X86_FEATURE_MPX))
+vmx_disable_intercept_for_msr(MSR_IA32_BNDCFGS, true);


This needs to be done unconditionally.  Otherwise, reading/writing
BNDCFGS will access a nonexistent VMCS field.

Paolo


 memcpy(vmx_msr_bitmap_legacy_x2apic,
 vmx_msr_bitmap_legacy, PAGE_SIZE);
 memcpy(vmx_msr_bitmap_longmode_x2apic,





--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kern

Re: [PATCH v3 2/4] KVM/X86: Intel MPX vmx and msr handle

2014-01-22 Thread Paolo Bonzini

Il 21/01/2014 20:01, Liu, Jinsong ha scritto:

From 31e68d752ac395dc6b65e6adf45be5324e92cdc8 Mon Sep 17 00:00:00 2001
From: Liu Jinsong 
Date: Fri, 13 Dec 2013 02:32:43 +0800
Subject: [PATCH v3 2/4] KVM/X86: Intel MPX vmx and msr handle

This patch handle vmx and msr of Intel MPX feature.

Signed-off-by: Xudong Hao 
Signed-off-by: Liu Jinsong 
---
 arch/x86/include/asm/vmx.h|2 ++
 arch/x86/include/uapi/asm/msr-index.h |1 +
 arch/x86/kvm/vmx.c|   12 ++--
 3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 966502d..1bf4681 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -85,6 +85,7 @@
 #define VM_EXIT_SAVE_IA32_EFER  0x0010
 #define VM_EXIT_LOAD_IA32_EFER  0x0020
 #define VM_EXIT_SAVE_VMX_PREEMPTION_TIMER   0x0040
+#define VM_EXIT_CLEAR_BNDCFGS   0x0080

 #define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR  0x00036dff

@@ -95,6 +96,7 @@
 #define VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL 0x2000
 #define VM_ENTRY_LOAD_IA32_PAT 0x4000
 #define VM_ENTRY_LOAD_IA32_EFER 0x8000
+#define VM_ENTRY_LOAD_BNDCFGS   0x0001

 #define VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR 0x11ff

diff --git a/arch/x86/include/uapi/asm/msr-index.h 
b/arch/x86/include/uapi/asm/msr-index.h
index 37813b5..2a418c4 100644
--- a/arch/x86/include/uapi/asm/msr-index.h
+++ b/arch/x86/include/uapi/asm/msr-index.h
@@ -294,6 +294,7 @@
 #define MSR_SMI_COUNT  0x0034
 #define MSR_IA32_FEATURE_CONTROL0x003a
 #define MSR_IA32_TSC_ADJUST 0x003b
+#define MSR_IA32_BNDCFGS   0x0d90

 #define FEATURE_CONTROL_LOCKED (1<<0)
 #define FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX   (1<<1)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index b2fe1c2..6d7d9ad 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -439,6 +439,7 @@ struct vcpu_vmx {
 #endif
int   gs_ldt_reload_needed;
int   fs_reload_needed;
+   u64   msr_host_bndcfgs;
} host_state;
struct {
int vm86_active;
@@ -1647,6 +1648,8 @@ static void vmx_save_host_state(struct kvm_vcpu *vcpu)
if (is_long_mode(&vmx->vcpu))
wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_guest_kernel_gs_base);
 #endif
+   if (boot_cpu_has(X86_FEATURE_MPX))
+   rdmsrl(MSR_IA32_BNDCFGS, vmx->host_state.msr_host_bndcfgs);
for (i = 0; i < vmx->save_nmsrs; ++i)
kvm_set_shared_msr(vmx->guest_msrs[i].index,
   vmx->guest_msrs[i].data,
@@ -1684,6 +1687,8 @@ static void __vmx_load_host_state(struct vcpu_vmx *vmx)
 #ifdef CONFIG_X86_64
wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_host_kernel_gs_base);
 #endif
+   if (vmx->host_state.msr_host_bndcfgs)
+   wrmsrl(MSR_IA32_BNDCFGS, vmx->host_state.msr_host_bndcfgs);
/*
 * If the FPU is not active (through the host task or
 * the guest vcpu), then restore the cr0.TS bit.
@@ -2800,7 +2805,7 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
 #endif
opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT |
-   VM_EXIT_ACK_INTR_ON_EXIT;
+   VM_EXIT_ACK_INTR_ON_EXIT | VM_EXIT_CLEAR_BNDCFGS;
if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_EXIT_CTLS,
&_vmexit_control) < 0)
return -EIO;
@@ -2817,7 +2822,7 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
_pin_based_exec_control &= ~PIN_BASED_POSTED_INTR;

min = 0;
-   opt = VM_ENTRY_LOAD_IA32_PAT;
+   opt = VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS;
if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_ENTRY_CTLS,
&_vmentry_control) < 0)
return -EIO;


You need to disable MPX in the guest if the two controls are not 
available.  You can do this, for example, in vmx_cpuid_update. 
Otherwise, nested VMX is broken.



@@ -8636,6 +8641,9 @@ static int __init vmx_init(void)
vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_CS, false);
vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_ESP, false);
vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_EIP, false);
+   if (boot_cpu_has(X86_FEATURE_MPX))
+   vmx_disable_intercept_for_msr(MSR_IA32_BNDCFGS, true);


This needs to be done unconditionally.  Otherwise, reading/writing 
BNDCFGS will access a nonexistent VMCS field.


Paolo


memcpy(vmx_msr_bitmap_legacy_x2apic,
vmx_msr_bitmap_legacy, PAGE_SIZE);
memcpy(vmx_msr_bitmap_longmode_x2apic,



--
To unsubscribe from this list: send the line "unsu

Re: [PATCH v3 1/4] KVM/X86: Fix xsave cpuid exposing bug

2014-01-22 Thread Paolo Bonzini

Il 21/01/2014 19:59, Liu, Jinsong ha scritto:

From 3155a190ce6ebb213e6c724240f4e6620ba67a9d Mon Sep 17 00:00:00 2001
From: Liu Jinsong 
Date: Fri, 13 Dec 2013 02:32:03 +0800
Subject: [PATCH v3 1/4] KVM/X86: Fix xsave cpuid exposing bug

EBX of cpuid(0xD, 0) is dynamic per XCR0 features enable/disable.
Bit 63 of XCR0 is reserved for future expansion.

Signed-off-by: Liu Jinsong 
---
 arch/x86/include/asm/xsave.h |2 ++
 arch/x86/kvm/cpuid.c |6 +++---
 arch/x86/kvm/x86.c   |7 +--
 3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/xsave.h b/arch/x86/include/asm/xsave.h
index 5547389..f6c4e85 100644
--- a/arch/x86/include/asm/xsave.h
+++ b/arch/x86/include/asm/xsave.h
@@ -13,6 +13,8 @@
 #define XSTATE_BNDCSR  0x10

 #define XSTATE_FPSSE   (XSTATE_FP | XSTATE_SSE)
+/* Bit 63 of XCR0 is reserved for future expansion */
+#define XSTATE_EXTEND_MASK (~(XSTATE_FPSSE | (1 << 63)))


This should be 1ULL, not 1.  The compiler should have warned you.

Paolo



 #define FXSAVE_SIZE512

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index c697625..2d661e6 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -28,7 +28,7 @@ static u32 xstate_required_size(u64 xstate_bv)
int feature_bit = 0;
u32 ret = XSAVE_HDR_SIZE + XSAVE_HDR_OFFSET;

-   xstate_bv &= ~XSTATE_FPSSE;
+   xstate_bv &= XSTATE_EXTEND_MASK;
while (xstate_bv) {
if (xstate_bv & 0x1) {
u32 eax, ebx, ecx, edx;
@@ -74,8 +74,8 @@ void kvm_update_cpuid(struct kvm_vcpu *vcpu)
vcpu->arch.guest_supported_xcr0 =
(best->eax | ((u64)best->edx << 32)) &
host_xcr0 & KVM_SUPPORTED_XCR0;
-   vcpu->arch.guest_xstate_size =
-   xstate_required_size(vcpu->arch.guest_supported_xcr0);
+   vcpu->arch.guest_xstate_size = best->ebx =
+   xstate_required_size(vcpu->arch.xcr0);
}

kvm_pmu_cpuid_update(vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 21ef1ba..1657ca2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -576,13 +576,13 @@ static void kvm_put_guest_xcr0(struct kvm_vcpu *vcpu)

 int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
 {
-   u64 xcr0;
+   u64 xcr0 = xcr;
+   u64 old_xcr0 = vcpu->arch.xcr0;
u64 valid_bits;

/* Only support XCR_XFEATURE_ENABLED_MASK(xcr0) now  */
if (index != XCR_XFEATURE_ENABLED_MASK)
return 1;
-   xcr0 = xcr;
if (!(xcr0 & XSTATE_FP))
return 1;
if ((xcr0 & XSTATE_YMM) && !(xcr0 & XSTATE_SSE))
@@ -599,6 +599,9 @@ int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)

kvm_put_guest_xcr0(vcpu);
vcpu->arch.xcr0 = xcr0;
+
+   if ((xcr0 ^ old_xcr0) & XSTATE_EXTEND_MASK)
+   kvm_update_cpuid(vcpu);
return 0;
 }




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-ppc] KVM and variable-endianness guest CPUs

2014-01-22 Thread Alexander Graf

On 22.01.2014, at 08:26, Victor Kamensky  wrote:

> On 21 January 2014 22:41, Alexander Graf  wrote:
>> 
>> 
>> "Native endian" really is just a shortcut for "target endian"
>> which is LE for ARM and BE for PPC. There shouldn't be
>> a qemu-system-armeb or qemu-system-ppc64le.
> 
> I disagree. Fully functional ARM BE system is what we've
> been working on for last few months. 'We' is Linaro
> Networking Group, Endian subteam and some other guys
> in ARM and across community. Why we do that is a bit
> beyond of this discussion.
> 
> ARM BE patches for both V7 and V8 are already in mainline
> kernel. But ARM BE KVM host is broken now. It is known
> deficiency that I am trying to fix. Please look at [1]. Patches
> for V7 BE KVM were proposed and currently under active
> discussion. Currently I work on ARM V8 BE KVM changes.
> 
> So "native endian" in ARM is value of CPSR register E bit.
> If it is off native endian is LE, if it is on it is BE.
> 
> Once and if we agree on ARM BE KVM host changes, the
> next step would be patches in qemu one of which introduces
> qemu-system-armeb. Please see [2].

I think we're facing an ideology conflict here. Yes, there should be a 
qemu-system-arm that is BE capable. There should also be a qemu-system-ppc64 
that is LE capable. But there is no point in changing the "default endiannes" 
for the virtual CPUs that we plug in there. Both CPUs are perfectly capable of 
running in LE or BE mode, the question is just what we declare the "default".

Think about the PPC bootstrap. We start off with a BE firmware, then boot into 
the Linux kernel which calls a hypercall to set the LE bit on every interrupt. 
But there's no reason this little endian kernel couldn't theoretically have big 
endian user space running with access to emulated device registers.

As Peter already pointed out, the actual breakage behind this is that we have a 
"default endianness" at all. But that's a very difficult thing to resolve and I 
don't think should be our primary goal. Just live with the fact that we declare 
ARM little endian in QEMU and swap things accordingly - then everyone's happy.

This really only ever becomes a problem if you have devices that have awareness 
of the CPUs endian mode. The only one on PPC that I'm aware of that falls into 
this category is virtio and there are patches pending to solve that. I don't 
know if there are any QEMU emulated devices outside of virtio with this issue 
on ARM, but you'll have to make the emulation code for those look at the CPU 
state then.

> 
>> QEMU emulates everything that comes after the CPU, so
>> imagine the ioctl struct as a bus package. Your bus
>> doesn't care what endianness the CPU is in - it just
>> gets data from the CPU.
> 
> I am not sure that I follow above. Suppose I have
> 
> move r1, #1
> str r1, [r0]
> 
> where r0 is device address. Now depending on CPSR
> E bit value device address will receive 1 as integer either
> in LE order or in BE order. That is how ARM v7 CPU
> works, regardless whether it is emulated or not.
> 
> So if E bit is off (LE case) after str is executed
> byte at r0 address will get 1
> byte at r0 + 1 address will get 0
> byte at r0 + 2 address will get 0
> byte at r0 + 3 address will get 0
> 
> If E bit is on (BE case) after str is executed
> byte at r0 address will get 0
> byte at r0 + 1 address will get 0
> byte at r0 + 2 address will get 0
> byte at r0 + 3 address will get 1
> 
> my point that mmio.data[] just carries bytes for phys_addr
> mmio.data[0] would be value for byte at phys_addr,
> mmio.data[1] would be value for byte at phys_addr + 1, and
> so on.

What we get is an instruction that traps because it wants to "write r1 (which 
has value=1) into address x". So at that point we get the register value.

Then we need to take a look at the E bit to see whether the write was supposed 
to be in non-host endianness because we need to emulate exactly the LE/BE 
difference you're indicating above. The way we implement this on PPC is that we 
simply byte swap the register value when guest_endian != host_endian.

With this in place, QEMU can just memcpy() the value into a local register and 
feed it into its emulation code which expects a "register value as if the CPU 
was running in native endianness" as parameter - with "native" meaning "little 
endian" for qemu-system-arm. Device emulation code doesn't know what to do with 
a byte array.

Take a look at QEMU's MMIO handler:

case KVM_EXIT_MMIO:
DPRINTF("handle_mmio\n");
cpu_physical_memory_rw(run->mmio.phys_addr,
   run->mmio.data,
   run->mmio.len,
   run->mmio.is_write);
ret = 0;
break;

which translates to

switch (l) {
case 8:
/* 64 bit write access */
val = ldq_p(buf);
error |= io_mem_write(mr, addr1, val, 8);

Re: [PATCH v3 1/4] KVM/X86: Fix xsave cpuid exposing bug

2014-01-22 Thread H. Peter Anvin
On 01/22/2014 02:21 AM, Paolo Bonzini wrote:
> Il 21/01/2014 19:59, Liu, Jinsong ha scritto:
>> From 3155a190ce6ebb213e6c724240f4e6620ba67a9d Mon Sep 17 00:00:00 2001
>> From: Liu Jinsong 
>> Date: Fri, 13 Dec 2013 02:32:03 +0800
>> Subject: [PATCH v3 1/4] KVM/X86: Fix xsave cpuid exposing bug
>>
>> EBX of cpuid(0xD, 0) is dynamic per XCR0 features enable/disable.
>> Bit 63 of XCR0 is reserved for future expansion.
>>
>> Signed-off-by: Liu Jinsong 
> 
> Peter, can I have your acked-by on this?
> 

Yes.

Acked-by: H. Peter Anvin 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/4] KVM/X86: Fix xsave cpuid exposing bug

2014-01-22 Thread Paolo Bonzini

Il 21/01/2014 19:59, Liu, Jinsong ha scritto:

From 3155a190ce6ebb213e6c724240f4e6620ba67a9d Mon Sep 17 00:00:00 2001
From: Liu Jinsong 
Date: Fri, 13 Dec 2013 02:32:03 +0800
Subject: [PATCH v3 1/4] KVM/X86: Fix xsave cpuid exposing bug

EBX of cpuid(0xD, 0) is dynamic per XCR0 features enable/disable.
Bit 63 of XCR0 is reserved for future expansion.

Signed-off-by: Liu Jinsong 


Peter, can I have your acked-by on this?


---
 arch/x86/include/asm/xsave.h |2 ++
 arch/x86/kvm/cpuid.c |6 +++---
 arch/x86/kvm/x86.c   |7 +--
 3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/xsave.h b/arch/x86/include/asm/xsave.h
index 5547389..f6c4e85 100644
--- a/arch/x86/include/asm/xsave.h
+++ b/arch/x86/include/asm/xsave.h
@@ -13,6 +13,8 @@
 #define XSTATE_BNDCSR  0x10

 #define XSTATE_FPSSE   (XSTATE_FP | XSTATE_SSE)
+/* Bit 63 of XCR0 is reserved for future expansion */
+#define XSTATE_EXTEND_MASK (~(XSTATE_FPSSE | (1 << 63)))

 #define FXSAVE_SIZE512

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index c697625..2d661e6 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -28,7 +28,7 @@ static u32 xstate_required_size(u64 xstate_bv)
int feature_bit = 0;
u32 ret = XSAVE_HDR_SIZE + XSAVE_HDR_OFFSET;

-   xstate_bv &= ~XSTATE_FPSSE;
+   xstate_bv &= XSTATE_EXTEND_MASK;
while (xstate_bv) {
if (xstate_bv & 0x1) {
u32 eax, ebx, ecx, edx;
@@ -74,8 +74,8 @@ void kvm_update_cpuid(struct kvm_vcpu *vcpu)
vcpu->arch.guest_supported_xcr0 =
(best->eax | ((u64)best->edx << 32)) &
host_xcr0 & KVM_SUPPORTED_XCR0;
-   vcpu->arch.guest_xstate_size =
-   xstate_required_size(vcpu->arch.guest_supported_xcr0);
+   vcpu->arch.guest_xstate_size = best->ebx =
+   xstate_required_size(vcpu->arch.xcr0);
}

kvm_pmu_cpuid_update(vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 21ef1ba..1657ca2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -576,13 +576,13 @@ static void kvm_put_guest_xcr0(struct kvm_vcpu *vcpu)

 int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
 {
-   u64 xcr0;
+   u64 xcr0 = xcr;
+   u64 old_xcr0 = vcpu->arch.xcr0;
u64 valid_bits;

/* Only support XCR_XFEATURE_ENABLED_MASK(xcr0) now  */
if (index != XCR_XFEATURE_ENABLED_MASK)
return 1;
-   xcr0 = xcr;
if (!(xcr0 & XSTATE_FP))
return 1;
if ((xcr0 & XSTATE_YMM) && !(xcr0 & XSTATE_SSE))
@@ -599,6 +599,9 @@ int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)

kvm_put_guest_xcr0(vcpu);
vcpu->arch.xcr0 = xcr0;
+
+   if ((xcr0 ^ old_xcr0) & XSTATE_EXTEND_MASK)
+   kvm_update_cpuid(vcpu);
return 0;
 }




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM and variable-endianness guest CPUs

2014-01-22 Thread Peter Maydell
On 22 January 2014 05:39, Victor Kamensky  wrote:
> Hi Guys,
>
> Christoffer and I had a bit heated chat :) on this
> subject last night. Christoffer, really appreciate
> your time! We did not really reach agreement
> during the chat and Christoffer asked me to follow
> up on this thread.
> Here it goes. Sorry, it is very long email.
>
> I don't believe we can assign any endianity to
> mmio.data[] byte array. I believe mmio.data[] and
> mmio.len acts just memcpy and that is all. As
> memcpy does not imply any endianity of underlying
> data mmio.data[] should not either.

This email is about five times too long to be actually
useful, but the major issue here is that the data being
transferred is not just a bag of bytes. The data[]
array plus the size field are being (mis)used to indicate
that the memory transaction is one of:
 * an 8 bit access
 * a 16 bit access of some uint16_t value
 * a 32 bit access of some uint32_t value
 * a 64 bit access of some uint64_t value

exactly as a CPU hardware bus would do. It's
because the API is defined in this awkward way with
a uint8_t[] array that we need to specify how both
sides should go from the actual properties of the
memory transaction (value and size) to filling in the
array.

Furthermore, device endianness is entirely irrelevant
for deciding the properties of mmio.data[], because the
thing we're modelling here is essentially the CPU->bus
interface. In real hardware, the properties of individual
devices on the bus are irrelevant to how the CPU's
interface to the bus behaves, and similarly here the
properties of emulated devices don't affect how KVM's
interface to QEMU userspace needs to work.

MemoryRegion's 'endianness' field, incidentally, is
a dreadful mess that we should get rid of. It is attempting
to model the property that some buses/bridges have of
doing byte-lane-swaps on data that passes through as
a property of the device itself. It would be better if we
modelled it properly, with container regions having possible
byte-swapping and devices just being devices.

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-ppc] KVM and variable-endianness guest CPUs

2014-01-22 Thread Anup Patel
Hi Alex,

On Wed, Jan 22, 2014 at 12:11 PM, Alexander Graf  wrote:
>
>
>> Am 22.01.2014 um 07:31 schrieb Anup Patel :
>>
>> On Wed, Jan 22, 2014 at 11:09 AM, Victor Kamensky
>>  wrote:
>>> Hi Guys,
>>>
>>> Christoffer and I had a bit heated chat :) on this
>>> subject last night. Christoffer, really appreciate
>>> your time! We did not really reach agreement
>>> during the chat and Christoffer asked me to follow
>>> up on this thread.
>>> Here it goes. Sorry, it is very long email.
>>>
>>> I don't believe we can assign any endianity to
>>> mmio.data[] byte array. I believe mmio.data[] and
>>> mmio.len acts just memcpy and that is all. As
>>> memcpy does not imply any endianity of underlying
>>> data mmio.data[] should not either.
>>>
>>> Here is my definition:
>>>
>>> mmio.data[] is array of bytes that contains memory
>>> bytes in such form, for read case, that if those
>>> bytes are placed in guest memory and guest executes
>>> the same read access instruction with address to this
>>> memory, result would be the same as real h/w device
>>> memory access. Rest of KVM host and hypervisor
>>> part of code should really take care of mmio.data[]
>>> memory so it will be delivered to vcpu registers and
>>> restored by hypervisor part in such way that guest CPU
>>> register value is the same as it would be for real
>>> non-emulated h/w read access (that is emulation part).
>>> The same goes for write access, if guest writes into
>>> memory and those bytes are just copied to emulated
>>> h/w register it would have the same effect as real
>>> mapped h/w register write.
>>>
>>> In shorter form, i.e for len=4 access: endianity of integer
>>> at &mmio.data[0] address should match endianity
>>> of emulated h/w device behind phys_addr address,
>>> regardless what is endianity of emulator, KVM host,
>>> hypervisor, and guest
>>>
>>> Examples that illustrate my definition
>>> --
>>>
>>> 1) LE guest (E bit is off in ARM speak) reads integer
>>> (4 bytes) from mapped h/w LE device register -
>>> mmio.data[3] contains MSB, mmio.data[0] contains LSB.
>>>
>>> 2) BE guest (E bit is on in ARM speak) reads integer
>>> from mapped h/w LE device register - mmio.data[3]
>>> contains MSB, mmio.data[0] contains LSB. Note that
>>> if &mmio.data[0] memory would be placed in guest
>>> address space and instruction restarted with new
>>> address, then it would meet BE guest expectations
>>> - the guest knows that it reads LE h/w so it will byteswap
>>> register before processing it further. This is BE guest ARM
>>> case (regardless of what KVM host endianity is).
>>>
>>> 3) BE guest reads integer from mapped h/w BE device
>>> register - mmio.data[0] contains MSB, mmio.data[3]
>>> contains LSB. Note that if &mmio.data[0] memory would
>>> be placed in guest address space and instruction
>>> restarted with new address, then it would meet BE
>>> guest expectation - the guest knows that it reads
>>> BE h/w so it will proceed further without any other
>>> work. I guess, it is BE ppc case.
>>>
>>>
>>> Arguments in favor of memcpy semantics of mmio.data[]
>>> --
>>>
>>> x) What are possible values of 'len'? Previous discussions
>>> imply that is always powers of 2. Why is that? Maybe
>>> there will be CPU that would need to do 5 bytes mmio
>>> access, or 6 bytes. How do you assign endianity to
>>> such case? 'len' 5 or 6, or any works fine with
>>> memcpy semantics. I admit it is hypothetical case, but
>>> IMHO it tests how clean ABI definition is.
>>>
>>> x) Byte array does not have endianity because it
>>> does not have any structure. If one would want to
>>> imply structure why mmio is not defined in such way
>>> so structure reflected in mmio definition?
>>> Something like:
>>>
>>>
>>>/* KVM_EXIT_MMIO */
>>>struct {
>>>  __u64 phys_addr;
>>>  union {
>>>   __u8 byte;
>>>   __u16 hword;
>>>   __u32 word;
>>>   __u64 dword;
>>>  }  data;
>>>  __u32 len;
>>>  __u8  is_write;
>>>} mmio;
>>>
>>> where len is really serves as union discriminator and
>>> only allowed len values are 1, 2, 4, 8.
>>> In this case, I agree, endianity of integer types
>>> should be defined. I believe, use of byte array strongly
>>> implies that original intent was to have semantics of
>>> byte stream copy, just like memcpy does.
>>>
>>> x) Note there is nothing wrong with user kernel ABI to
>>> use just bytes stream as parameter. There is already
>>> precedents like 'read' and 'write' system calls :).
>>>
>>> x) Consider case when KVM works with emulated memory mapped
>>> h/w devices where some devices operate in LE mode and others
>>> operate in BE mode. It is defined by semantics of real h/w
>>> device whic