[Bug?] qemu abort when trying to passthrough BCM5719 Gigabit Ethernet

2014-10-10 Thread zhanghailiang

Hi all,

When i try to passthrough BCM5719 Gigabit Ethernet to guest using the qemu 
master branch, it aborted,
and show kvm_set_phys_mem:error registering slot:Bad Address.

qemu command:
#./qemu/qemu/x86_64-softmmu/qemu-system-x86_64 --enable-kvm -smp 4 -m 4096 -vnc 
:99 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x3  -drive 
file=/home/suse11_sp3_64,if=none,id=drive-scsi0-0-0-0,format=raw,cache=none,aio=native
 -device scsi-hd,bus=scsi0.0,channel=0,scsi-
id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0 -device 
pci-assign,host=01:00.1,id=mydevice -net none

info about guest and host:
host OS: 3.16.5
*guest OS: Novell SuSE Linux Enterprise Server 11 SP3*
#cat /proc/cpuinfo
processor   : 31
vendor_id   : GenuineIntel
cpu family  : 6
model   : 62
model name  : Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz
stepping: 4
microcode   : 0x416
cpu MHz : 1926.875
cache size  : 20480 KB
physical id : 1
siblings: 16
core id : 7
cpu cores   : 8
apicid  : 47
initial apicid  : 47
fpu : yes
fpu_exception   : yes
cpuid level : 13
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology 
nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx
smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt 
tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln
pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips: 4005.35
clflush size: 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

gdb info:
(gdb) bt
#0  0x71ce9989 in raise () from /usr/lib64/libc.so.6
#1  0x71ceb098 in abort () from /usr/lib64/libc.so.6
#2  0x556275cf in kvm_set_phys_mem (section=0x7fffedcea790, add=true) 
at /home/qemu/qemu/kvm-all.c:711
#3  0x5562980f in address_space_update_topology_pass 
(as=as@entry=0x55d01ca0 , adding=adding@entry=true,
new_view=, new_view=, 
old_view=0x7fffe8022a90, old_view=0x7fffe8022a90) at /home/qemu/qemu/memory.c:752
#4  0x5562b910 in address_space_update_topology (as=0x55d01ca0 
) at /home/qemu/qemu/memory.c:767
#5  memory_region_transaction_commit () at /home/qemu/qemu/memory.c:808
#6  0x557a75b4 in pci_update_mappings (d=0x562ba9f0) at 
hw/pci/pci.c:1113
#7  0x557a7932 in pci_default_write_config (d=d@entry=0x562ba9f0, 
addr=addr@entry=20, val_in=val_in@entry=4294967295, l=l@entry=4)
at hw/pci/pci.c:1165
#8  0x5566c17e in assigned_dev_pci_write_config 
(pci_dev=0x562ba9f0, address=20, val=4294967295, len=4)
at /home/qemu/qemu/hw/i386/kvm/pci-assign.c:1196
#9  0x55628fea in access_with_adjusted_size (addr=addr@entry=0, 
value=value@entry=0x7fffedceaae0, size=size@entry=4,
access_size_min=, access_size_max=, 
access=0x55629160 , mr=0x56231f00)
at /home/qemu/qemu/memory.c:480
#10 0x5562dbf7 in memory_region_dispatch_write (size=4, 
data=18446744073709551615, addr=0, mr=0x56231f00) at 
/home/qemu/qemu/memory.c:1122
#11 io_mem_write (mr=mr@entry=0x56231f00, addr=0, val=, 
size=4) at /home/qemu/qemu/memory.c:1958
#12 0x555f8963 in address_space_rw (as=0x55d01d80 , 
addr=addr@entry=3324, buf=0x77fec000 "\377\377\377\377",
len=len@entry=4, is_write=is_write@entry=true) at 
/home/qemu/qemu/exec.c:2145
#13 0x55628491 in kvm_handle_io (count=1, size=4, direction=, 
data=, port=3324) at /home/qemu/qemu/kvm-all.c:1614
#14 kvm_cpu_exec (cpu=cpu@entry=0x5620e610) at 
/home/qemu/qemu/kvm-all.c:1771
#15 0x55617182 in qemu_kvm_cpu_thread_fn (arg=0x5620e610) at 
/home/qemu/qemu/cpus.c:953
#16 0x76ba2df3 in start_thread () from /usr/lib64/libpthread.so.0
#17 0x71daa3dd in clone () from /usr/lib64/libc.so.6

messages log:
Oct 10 07:43:18 localhost kernel: kvm: zapping shadow pages for mmio generation 
wraparound
Oct 10 07:43:27 localhost kernel: kvm [13251]: vcpu0 disabled perfctr wrmsr: 
0xc1 data 0xabcd
Oct 10 07:43:28 localhost kernel: intel_iommu_map: iommu width (48) is not 
sufficient for the mapped address (fe001000)
Oct 10 07:43:28 localhost kernel: kvm_iommu_map_address:iommu failed to map 
pfn=94880

After bisected the commits, i found the commit 453c43a4a241a7 leads to this 
problem.

commit 57271d63c4d93352406704d540453c43a4a241a7
Author: Paolo Bonzini 
Date:   Thu Nov 7 17:14:37 2013 +0100

exec: make address spaces 64-bit wide

As an alternative to commit 818f86b (exec: limit system memory
size, 2013-11-04) let's just make all address spaces 64-bit wide.
This eliminates problems with phys_page_find ignoring bits above
TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
consequently

Re: [PATCH] Documentation: virtual: kvm: correct one bit description in APF case

2014-10-10 Thread Paolo Bonzini
Il 11/10/2014 03:19, Tiejun Chen ha scritto:
> When commit 6adba5274206 (KVM: Let host know whether the guest can
> handle async PF in non-userspace context.) is introduced, actually
> bit 2 still is reserved and should be zero. Instead, bit 1 is 1 to
> indicate if asynchronous page faults can be injected when vcpu is
> in cpl == 0, and also please see this,
> 
> in the file kvm_para.h, #define KVM_ASYNC_PF_SEND_ALWAYS (1 << 1).
> 
> Signed-off-by: Tiejun Chen 
> ---
>  Documentation/virtual/kvm/msr.txt | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/Documentation/virtual/kvm/msr.txt 
> b/Documentation/virtual/kvm/msr.txt
> index 6d470ae..2a71c8f 100644
> --- a/Documentation/virtual/kvm/msr.txt
> +++ b/Documentation/virtual/kvm/msr.txt
> @@ -168,7 +168,7 @@ MSR_KVM_ASYNC_PF_EN: 0x4b564d02
>   64 byte memory area which must be in guest RAM and must be
>   zeroed. Bits 5-2 are reserved and should be zero. Bit 0 is 1
>   when asynchronous page faults are enabled on the vcpu 0 when
> - disabled. Bit 2 is 1 if asynchronous page faults can be injected
> + disabled. Bit 1 is 1 if asynchronous page faults can be injected
>   when vcpu is in cpl == 0.
>  
>   First 4 byte of 64 byte memory location will be written to by
> 

Thanks, applying locally.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


bug: host crash when vf is passthrough to vm.

2014-10-10 Thread ChenLiang
Hi all:

kernel: 3.0.93-0.8-default
qemu: 1.5

crash log:

[134397.708857] BUG: unable to handle kernel NULL pointer dereference at 
0012
[134397.717334] IP: [] iommu_disable_dev_iotlb+0x15/0x30
[134397.724268] PGD 0
[134397.726686] Oops:  [#1] SMP
[134397.730335] kbox: Begin to handle event info
[134397.734992] kbox: kbox: Enter into handle die  dump while current 
state:Dump State Init
[134397.751054]
[134398.043275] kbox: End  handling event info
[134398.047757] CPU 1
[134398.049678] Modules linked in: mlx4_en(FX) mlx4_core(FX) compat(FX) 
openvswitch crc32c libcrc32c gre nm_dev(FN) ip6table_filter ip6_tables 
iptable_filter ip_tables ebtable_nat ebtables x_tables uvpdump i
[134398.125560] Supported: No, Unsupported modules are loaded
[134398.131338]
[134398.133220] Pid: 183980, comm: qemu-kvm Tainted: GF  NX 
3.0.93-0.8-default #1 HUAWEI TECHNOLOGIES CO.,LTD. CH80TXSUA/CH80TXSUA
[134398.146013] RIP: 0010:[]  [] iommu_disable_dev_iotlb+0x15/0x30
[134398.155649] RSP: 0018:8817f0e13be0  EFLAGS: 00010202
[134398.161342] RAX: 0002 RBX: 880bf34e6600 RCX: 
0007
[134398.169158] RDX: 880bf34e6650 RSI: 0292 RDI: 
880bf2011000
[134398.176970] RBP:  R08: dead00200200 R09: 
dead00100100
[134398.184793] R10: 8817f5d4c380 R11: 8128e350 R12: 
880bf34e6640
[134398.192549] R13: 880bf5d3de80 R14: 880653e4a000 R15: 
880bf34d7858
[134398.200360] FS:  7f6d17fff980() GS:880c3ee2() 
knlGS:
[134398.209109] CS:  0010 DS:  ES:  CR0: 8005003b
[134398.215223] CR2: 0012 CR3: 00096ed95000 CR4: 
001427e0
[134398.223000] DR0:  DR1:  DR2: 

[134398.230812] DR3:  DR6: 0ff0 DR7: 
0400
[134398.238630] Process qemu-kvm (pid: 183980, threadinfo 8817f0e12000, 
task 8817804ac300)
[134398.247906] Stack:
[134398.250260]  8128f756 ffed 880bf34d7840 
0292
[134398.258383]  880bf34d7640 880653e4a090 ffed 
880653e4a000
[134398.266509]  880653e4a000 880bee6c4878 812938c6 
880bebe17d60
[134398.274659] Call Trace:
[134398.277503]  [] domain_remove_one_dev_info+0x156/0x280
[134398.284582]  [] intel_iommu_attach_device+0x156/0x170
[134398.291583]  [] kvm_assign_device+0x73/0x150 [kvm]
[134398.298360]  [] kvm_vm_ioctl_assign_device+0x247/0x3c0 [kvm]
[134398.306303]  [] kvm_vm_ioctl_assigned_device+0x2fc/0x6a0 [kvm]
[134398.314401]  [] kvm_vm_ioctl+0x101/0x300 [kvm]
[134398.320789]  [] do_vfs_ioctl+0x8b/0x3b0
[134398.326566]  [] sys_ioctl+0xa1/0xb0
[134398.331996]  [] system_call_fastpath+0x16/0x1b
[134398.338376]  [<7f6d15b77e57>] 0x7f6d15b77e56
[134398.343373] Code: c6 05 97 dc de 00 01 eb a7 66 66 66 66 2e 0f 1f 84 00 00 
00 00 00 48 8b 7f 28 48 85 ff 74 12 48 8b 87 30 09 00 00 48 85 c0 74 06 40 10 
01 75 05 f3 c3 0f 1f 00 e9 ab 61 00 00 66 66
[134398.364424] RIP  [] iommu_disable_dev_iotlb+0x15/0x30
[134398.371441]  RSP
[134398.375314] CR2: 0012

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Documentation: virtual: kvm: correct one bit description in APF case

2014-10-10 Thread Tiejun Chen
When commit 6adba5274206 (KVM: Let host know whether the guest can
handle async PF in non-userspace context.) is introduced, actually
bit 2 still is reserved and should be zero. Instead, bit 1 is 1 to
indicate if asynchronous page faults can be injected when vcpu is
in cpl == 0, and also please see this,

in the file kvm_para.h, #define KVM_ASYNC_PF_SEND_ALWAYS (1 << 1).

Signed-off-by: Tiejun Chen 
---
 Documentation/virtual/kvm/msr.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/virtual/kvm/msr.txt 
b/Documentation/virtual/kvm/msr.txt
index 6d470ae..2a71c8f 100644
--- a/Documentation/virtual/kvm/msr.txt
+++ b/Documentation/virtual/kvm/msr.txt
@@ -168,7 +168,7 @@ MSR_KVM_ASYNC_PF_EN: 0x4b564d02
64 byte memory area which must be in guest RAM and must be
zeroed. Bits 5-2 are reserved and should be zero. Bit 0 is 1
when asynchronous page faults are enabled on the vcpu 0 when
-   disabled. Bit 2 is 1 if asynchronous page faults can be injected
+   disabled. Bit 1 is 1 if asynchronous page faults can be injected
when vcpu is in cpl == 0.
 
First 4 byte of 64 byte memory location will be written to by
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 3/3] arm64: Allow 48-bits VA space without ARM_SMMU

2014-10-10 Thread Christoffer Dall
On Fri, Oct 10, 2014 at 04:10:53PM +0100, Catalin Marinas wrote:
> On Fri, Oct 10, 2014 at 11:14:30AM +0100, Christoffer Dall wrote:
> > Now when KVM has been reworked to support 48-bits host VA space, we can
> > allow systems to be configured with this option.  However, the ARM SMMU
> > driver also needs to be tweaked for 48-bit support so only allow the
> > config option to be set when not including support for theSMMU.
> > 
> > Signed-off-by: Christoffer Dall 
> > ---
> >  arch/arm64/Kconfig | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index fd4e81a..a76c6c3b 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -200,7 +200,7 @@ config ARM64_VA_BITS_42
> >  
> >  config ARM64_VA_BITS_48
> > bool "48-bit"
> > -   depends on BROKEN
> > +   depends on !ARM_SMMU
> >  
> >  endchoice
> 
> I think we should rather merge this separately via the arm64 tree once
> we test 48-bit VA some more (and as you noticed, there is a bug
> already).
> 
Sounds like a good idea, I'll apply the other two to kvmarm.

Thanks for the reviews!

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 13/13] vfio: powerpc/spapr: Enable Dynamic DMA windows

2014-10-10 Thread Alexey Kardashevskiy
On 09/23/2014 11:56 PM, Alex Williamson wrote:
> On Tue, 2014-09-23 at 13:01 +1000, Alexey Kardashevskiy wrote:
>> This defines and implements VFIO IOMMU API which lets the userspace
>> create and remove DMA windows.
>>
>> This updates VFIO_IOMMU_SPAPR_TCE_GET_INFO to return the number of
>> available windows and page mask.
>>
>> This adds VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE
>> to allow the user space to create and remove window(s).
>>
>> The VFIO IOMMU driver does basic sanity checks and calls corresponding
>> SPAPR TCE functions. At the moment only IODA2 (POWER8 PCI host bridge)
>> implements them.
>>
>> This advertises VFIO_IOMMU_SPAPR_TCE_FLAG_DDW capability via
>> VFIO_IOMMU_SPAPR_TCE_GET_INFO.
>>
>> This calls platform DDW reset() callback when IOMMU is being disabled
>> to reset the DMA configuration to its original state.
>>
>> Signed-off-by: Alexey Kardashevskiy 
>> ---
>>  drivers/vfio/vfio_iommu_spapr_tce.c | 135 
>> ++--
>>  include/uapi/linux/vfio.h   |  25 ++-
>>  2 files changed, 153 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
>> b/drivers/vfio/vfio_iommu_spapr_tce.c
>> index 0dccbc4..b518891 100644
>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>> @@ -190,18 +190,25 @@ static void tce_iommu_disable(struct tce_container 
>> *container)
>>  
>>  container->enabled = false;
>>  
>> -if (!container->grp || !current->mm)
>> +if (!container->grp)
>>  return;
>>  
>>  data = iommu_group_get_iommudata(container->grp);
>>  if (!data || !data->iommu_owner || !data->ops->get_table)
>>  return;
>>  
>> -tbl = data->ops->get_table(data, 0);
>> -if (!tbl)
>> -return;
>> +if (current->mm) {
>> +tbl = data->ops->get_table(data, 0);
>> +if (tbl)
>> +decrement_locked_vm(tbl);
>>  
>> -decrement_locked_vm(tbl);
>> +tbl = data->ops->get_table(data, 1);
>> +if (tbl)
>> +decrement_locked_vm(tbl);
>> +}
>> +
>> +if (data->ops->reset)
>> +data->ops->reset(data);
>>  }
>>  
>>  static void *tce_iommu_open(unsigned long arg)
>> @@ -243,7 +250,7 @@ static long tce_iommu_ioctl(void *iommu_data,
>>   unsigned int cmd, unsigned long arg)
>>  {
>>  struct tce_container *container = iommu_data;
>> -unsigned long minsz;
>> +unsigned long minsz, ddwsz;
>>  long ret;
>>  
>>  switch (cmd) {
>> @@ -288,6 +295,28 @@ static long tce_iommu_ioctl(void *iommu_data,
>>  info.dma32_window_size = tbl->it_size << tbl->it_page_shift;
>>  info.flags = 0;
>>  
>> +ddwsz = offsetofend(struct vfio_iommu_spapr_tce_info,
>> +page_size_mask);
>> +
>> +if (info.argsz == ddwsz) {
> 
>> =
> 
>> +if (data->ops->query && data->ops->create &&
>> +data->ops->remove) {
>> +info.flags |= VFIO_IOMMU_SPAPR_TCE_FLAG_DDW;
> 
> I think you want to set this flag regardless of whether the user has
> provided space for it.  A valid use model is to call with the minimum
> size and look at the flags to determine if it needs to be called again
> with a larger size.
> 
>> +
>> +ret = data->ops->query(data,
>> +&info.current_windows,
>> +&info.windows_available,
>> +&info.page_size_mask);
>> +if (ret)
>> +return ret;
>> +} else {
>> +info.current_windows = 0;
>> +info.windows_available = 0;
>> +info.page_size_mask = 0;
>> +}
>> +minsz = ddwsz;
> 
> It's not really any longer the min size, is it?
> 
>> +}
>> +
>>  if (copy_to_user((void __user *)arg, &info, minsz))
>>  return -EFAULT;
>>  
>> @@ -412,12 +441,106 @@ static long tce_iommu_ioctl(void *iommu_data,
>>  tce_iommu_disable(container);
>>  mutex_unlock(&container->lock);
>>  return 0;
>> +
>>  case VFIO_EEH_PE_OP:
>>  if (!container->grp)
>>  return -ENODEV;
>>  
>>  return vfio_spapr_iommu_eeh_ioctl(container->grp,
>>cmd, arg);
>> +
>> +case VFIO_IOMMU_SPAPR_TCE_CREATE: {
>> +struct vfio_iommu_spapr_tce_create create;
>> +struct spapr_tce_iommu_group *data;
>> +struct iommu_table *tbl;
>> +
>> +if (WARN_ON(!container->grp))
> 
> redux previous comment on this warning
> 
>> + 

[PATCH v2 0/2] deadline TSC multi injection fix

2014-10-10 Thread Radim Krčmář
First patch refactors code and introduces a minor should-not-happen
optimization.  Second patch fixes the bug and improves handling of passed
deadline timers.

I still haven't looked at Amit's proposed optimization (not recomputing the
value of hrtimer on rewrites) long enough to be able to sign it off, sorry.

Radim Krčmář (2):
  KVM: x86: add apic_timer_expired()
  KVM: x86: fix deadline tsc interrupt injection

 arch/x86/kvm/lapic.c | 47 +--
 1 file changed, 25 insertions(+), 22 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/2] KVM: x86: fix deadline tsc interrupt injection

2014-10-10 Thread Radim Krčmář
The check in kvm_set_lapic_tscdeadline_msr() was trying to prevent a
situation where we lose a pending deadline timer in a MSR write.
Losing it is fine, because it effectively occurs before the timer fired,
so we should be able to cancel or postpone it.

Another problem comes from interaction with QEMU, or other userspace
that can set deadline MSR without a good reason, when timer is already
pending:  one guest's deadline request results in more than one
interrupt because one is injected immediately on MSR write from
userspace and one through hrtimer later.

The solution is to remove the injection when replacing a pending timer
and to improve the usual QEMU path, we inject without a hrtimer when the
deadline has already passed.

Signed-off-by: Radim Krčmář 
Reported-by: Nadav Amit 
---
 arch/x86/kvm/lapic.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index a3cf68b..8b30099 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1116,9 +1116,10 @@ static void start_apic_timer(struct kvm_lapic *apic)
if (likely(tscdeadline > guest_tsc)) {
ns = (tscdeadline - guest_tsc) * 100ULL;
do_div(ns, this_tsc_khz);
-   }
-   hrtimer_start(&apic->lapic_timer.timer,
-   ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
+   hrtimer_start(&apic->lapic_timer.timer,
+   ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
+   } else
+   apic_timer_expired(apic);
 
local_irq_restore(flags);
}
@@ -1375,9 +1376,6 @@ void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, 
u64 data)
return;
 
hrtimer_cancel(&apic->lapic_timer.timer);
-   /* Inject here so clearing tscdeadline won't override new value */
-   if (apic_has_pending_timer(vcpu))
-   kvm_inject_apic_timer_irqs(vcpu);
apic->lapic_timer.tscdeadline = data;
start_apic_timer(apic);
 }
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/2] KVM: x86: add apic_timer_expired()

2014-10-10 Thread Radim Krčmář
Make the code reusable.

If the timer was already pending, we shouldn't be waiting in a queue,
so wake_up can be skipped, simplifying the path.

There is no 'reinject' case => the comment is removed.
Current race behaves correctly.

Signed-off-by: Radim Krčmář 
---
 Not sure about the FIXME, and it might motivate someone to redo the
 implicit checking as well :)

 arch/x86/kvm/lapic.c | 37 +
 1 file changed, 21 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index b8345dd..a3cf68b 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1034,6 +1034,26 @@ static void update_divide_count(struct kvm_lapic *apic)
   apic->divide_count);
 }
 
+static void apic_timer_expired(struct kvm_lapic *apic)
+{
+   struct kvm_vcpu *vcpu = apic->vcpu;
+   wait_queue_head_t *q = &vcpu->wq;
+
+   /*
+* Note: KVM_REQ_PENDING_TIMER is implicitly checked in
+* vcpu_enter_guest.
+*/
+   if (atomic_read(&apic->lapic_timer.pending))
+   return;
+
+   atomic_inc(&apic->lapic_timer.pending);
+   /* FIXME: this code should not know anything about vcpus */
+   kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu);
+
+   if (waitqueue_active(q))
+   wake_up_interruptible(q);
+}
+
 static void start_apic_timer(struct kvm_lapic *apic)
 {
ktime_t now;
@@ -1538,23 +1558,8 @@ static enum hrtimer_restart apic_timer_fn(struct hrtimer 
*data)
 {
struct kvm_timer *ktimer = container_of(data, struct kvm_timer, timer);
struct kvm_lapic *apic = container_of(ktimer, struct kvm_lapic, 
lapic_timer);
-   struct kvm_vcpu *vcpu = apic->vcpu;
-   wait_queue_head_t *q = &vcpu->wq;
 
-   /*
-* There is a race window between reading and incrementing, but we do
-* not care about potentially losing timer events in the !reinject
-* case anyway. Note: KVM_REQ_PENDING_TIMER is implicitly checked
-* in vcpu_enter_guest.
-*/
-   if (!atomic_read(&ktimer->pending)) {
-   atomic_inc(&ktimer->pending);
-   /* FIXME: this code should not know anything about vcpus */
-   kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu);
-   }
-
-   if (waitqueue_active(q))
-   wake_up_interruptible(q);
+   apic_timer_expired(apic);
 
if (lapic_is_periodic(apic)) {
hrtimer_add_expires_ns(&ktimer->timer, ktimer->period);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 81841] amd-iommu: kernel BUG & lockup after shutting down KVM guest using PCI passthrough/PCIe bridge

2014-10-10 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=81841

Marti Raudsepp  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |CODE_FIX

--- Comment #23 from Marti Raudsepp  ---
Closing bug, ACS patch merged to mainline Linux:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=3587e625fe24a2d1cd1891fc660c3313151a368c

Thanks Joerg.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/5] KVM: x86: Emulator performs code segment checks on read access

2014-10-10 Thread Radim Krčmář
2014-10-10 05:07+0300, Nadav Amit:
> When read access is performed using a readable code segment, the "conforming"
> and "non-conforming" checks should not be done.  As a result, read using
> non-conforming readable code segment fails.
> 
> This is according to Intel SDM 5.6.1 ("Accessing Data in Code Segments").
> 
> One exception is the case of conforming code segment. The SDM says: "Use a
> code-segment override prefix (CS) to read a readable...  [it is] valid because
> the DPL of the code segment selected by the CS register is the same as the
> CPL." This is misleading since CS.DPL may be lower (numerically) than CPL, and
> CS would still be accessible.  The emulator should avoid privilage level 
> checks
> for data reads using CS.

Ah, after stripping faulty presumptions, I'm not sure this change is
enough ... shouldn't we also skip the check on conforming code segments?

 Method 2 is always valid because the privilege level of a conforming
 code segment is effectively the same as the CPL, regardless of its DPL.

And we couldn't read it from less privileged modes.

> The fix is not to perform the "non-conforming" checks if the access is not a
> fetch, and never to perform the checks for CS.
> 
> ---
> v1->v2: Privilage level checks are always skipped for CS
> 
> Signed-off-by: Nadav Amit 
> ---
>  arch/x86/kvm/emulate.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index a46207a..0fee0a0 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -661,9 +661,9 @@ static int __linearize(struct x86_emulate_ctxt *ctxt,
>   goto bad;
>   }
>   cpl = ctxt->ops->cpl(ctxt);
> - if (!(desc.type & 8)) {
> - /* data segment */
> - if (cpl > desc.dpl)
> + if (!fetch) {
> + /* data segment or readable code segment */
> + if (cpl > desc.dpl && addr.seg != VCPU_SREG_CS)
>   goto bad;
>   } else if ((desc.type & 8) && !(desc.type & 4)) {
>   /* nonconforming code segment */
> -- 
> 1.9.1
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 3/3] arm64: Allow 48-bits VA space without ARM_SMMU

2014-10-10 Thread Catalin Marinas
On Fri, Oct 10, 2014 at 11:14:30AM +0100, Christoffer Dall wrote:
> Now when KVM has been reworked to support 48-bits host VA space, we can
> allow systems to be configured with this option.  However, the ARM SMMU
> driver also needs to be tweaked for 48-bit support so only allow the
> config option to be set when not including support for theSMMU.
> 
> Signed-off-by: Christoffer Dall 
> ---
>  arch/arm64/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index fd4e81a..a76c6c3b 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -200,7 +200,7 @@ config ARM64_VA_BITS_42
>  
>  config ARM64_VA_BITS_48
>   bool "48-bit"
> - depends on BROKEN
> + depends on !ARM_SMMU
>  
>  endchoice

I think we should rather merge this separately via the arm64 tree once
we test 48-bit VA some more (and as you noticed, there is a bug
already).

-- 
Catalin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/3] arm64: KVM: Implement 48 VA support for KVM EL2 and Stage-2

2014-10-10 Thread Catalin Marinas
On Fri, Oct 10, 2014 at 11:14:28AM +0100, Christoffer Dall wrote:
> This patch adds the necessary support for all host kernel PGSIZE and
> VA_SPACE configuration options for both EL2 and the Stage-2 page tables.
> 
> However, for 40bit and 42bit PARange systems, the architecture mandates
> that VTCR_EL2.SL0 is maximum 1, resulting in fewer levels of stage-2
> pagge tables than levels of host kernel page tables.  At the same time,
> systems with a PARange > 42bit, we limit the IPA range by always setting
> VTCR_EL2.T0SZ to 24.
> 
> To solve the situation with different levels of page tables for Stage-2
> translation than the host kernel page tables, we allocate a dummy PGD
> with pointers to our actual inital level Stage-2 page table, in order
> for us to reuse the kernel pgtable manipulation primitives.  Reproducing
> all these in KVM does not look pretty and unnecessarily complicates the
> 32-bit side.
> 
> Systems with a PARange < 40bits are not yet supported.
> 
>  [ I have reworked this patch from its original form submitted by
>Jungseok to take the architecture constraints into consideration.
>There were too many changes from the original patch for me to
>preserve the authorship.  Thanks to Catalin Marinas for his help in
>figuring out a good solution to this challenge.  I have also fixed
>various bugs and missing error code handling from the original
>patch. - Christoffer ]
> 
> Cc: Marc Zyngier 
> Cc: Catalin Marinas 
> Signed-off-by: Jungseok Lee 
> Signed-off-by: Christoffer Dall 

Reviewed-by: Catalin Marinas 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/3] arm/arm64: KVM: Ensure memslots are within KVM_PHYS_SIZE

2014-10-10 Thread Catalin Marinas
On Fri, Oct 10, 2014 at 11:14:29AM +0100, Christoffer Dall wrote:
> When creating or moving a memslot, make sure the IPA space is within the
> addressable range of the guest.  Otherwise, user space can create too
> large a memslot and KVM would try to access potentially unallocated page
> table entries when inserting entries in the Stage-2 page tables.
> 
> Signed-off-by: Christoffer Dall 

Acked-by: Catalin Marinas 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH kvm-kmod 0/3] First round of kvm-kmod fixes for 3.18 merge window

2014-10-10 Thread Jan Kiszka
On 2014-10-09 13:53, Paolo Bonzini wrote:
> Patches are relative to next branch of kvm-kmod.git.
> 
> Paolo
> 
> Paolo Bonzini (3):
>   FOLL_TRIED is not available before 3.18
>   the MMU notifier clear_flush_young callback changed in 3.18
>   redefine is_zero_pfn to not rely on zero_pfn
> 
>  external-module-compat-comm.h |  5 -
>  sync  | 16 
>  2 files changed, 20 insertions(+), 1 deletion(-)
> 

Thanks, applied.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] KVM: x86: Using TSC deadline may cause multiple interrupts by user writes

2014-10-10 Thread Paolo Bonzini
Il 08/10/2014 12:06, Radim Krčmář ha scritto:
> KVM: x86: fix deadline tsc interrupt injection
> 
> The check in kvm_set_lapic_tscdeadline_msr() was trying to prevent a
> situation where we lose a pending deadline timer in a MSR write.
> Losing it is fine, because it effectively occurs before the timer fired,
> so we should be able to cancel or postpone it.
> 
> Another problem comes from interaction with QEMU, or other userspace
> that can set deadline MSR without a good reason, when timer is already
> pending:  one guest's deadline request results in more than one
> interrupt because one is injected immediately on MSR write from
> userspace and one through hrtimer later.
> 
> The solution is to remove the injection when replacing a pending timer
> and to improve the usual QEMU path, we inject without a hrtimer when the
> deadline has already passed.
> 
> Signed-off-by: Radim Krčmář 
> Reported-by: Nadav Amit 
> ---
>  arch/x86/kvm/lapic.c | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index b8345dd..51428dd 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1096,9 +1096,12 @@ static void start_apic_timer(struct kvm_lapic *apic)
>   if (likely(tscdeadline > guest_tsc)) {
>   ns = (tscdeadline - guest_tsc) * 100ULL;
>   do_div(ns, this_tsc_khz);
> + hrtimer_start(&apic->lapic_timer.timer,
> + ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
> + } else {
> + atomic_inc(&ktimer->pending);
> + kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu);
>   }
> - hrtimer_start(&apic->lapic_timer.timer,
> - ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
>  
>   local_irq_restore(flags);
>   }
> @@ -1355,9 +1358,6 @@ void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu 
> *vcpu, u64 data)
>   return;
>  
>   hrtimer_cancel(&apic->lapic_timer.timer);
> - /* Inject here so clearing tscdeadline won't override new value */
> - if (apic_has_pending_timer(vcpu))
> - kvm_inject_apic_timer_irqs(vcpu);
>   apic->lapic_timer.tscdeadline = data;
>   start_apic_timer(apic);
>  }

Radim, the patch looks good but please extract this code:

/*
 * There is a race window between reading and incrementing, but we do
 * not care about potentially losing timer events in the !reinject
 * case anyway. Note: KVM_REQ_PENDING_TIMER is implicitly checked
 * in vcpu_enter_guest.
 */
if (!atomic_read(&ktimer->pending)) {
atomic_inc(&ktimer->pending);
/* FIXME: this code should not know anything about vcpus */
kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu);
}

if (waitqueue_active(q))
wake_up_interruptible(q);

to a new "static void apic_timer_expired(struct kvm_lapic *apic)" function,
and call it from both apic_timer_fn and start_apic_timer.

Also, we should not need to do wake_up_interruptible() unless we have
changed ktimer->pending from zero to non-zero.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] KVM: x86: Using TSC deadline may cause multiple interrupts by user writes

2014-10-10 Thread Paolo Bonzini
Il 10/10/2014 14:51, Nadav Amit ha scritto:
>>> Second, I think that the solution I proposed would perform better.
>>> Currently, there are many unnecessary cancellations and setups of the
>>> timer. This solution does not resolve this problem.
>>
>> I think it does.  You do not get an hrtimer_start if tscdeadline <=
>> guest_tsc.  To avoid useless cancels, either check hrtimer_is_enqueued
>> before calling hrtimer_cancel, or go straight to the source and avoid
>> taking the lock in the easy cases:
>>
>> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
>> index 1c2fe7de2842..6ce725007424 100644
>> --- a/kernel/time/hrtimer.c
>> +++ b/kernel/time/hrtimer.c
>> @@ -1043,10 +1043,17 @@ int hrtimer_try_to_cancel(struct hrtimer *timer)
>> {
>>  struct hrtimer_clock_base *base;
>>  unsigned long flags;
>> -int ret = -1;
>> +unsigned long state = timer->state;
>> +int ret;
>> +
>> +if (state & HRTIMER_STATE_ENQUEUED)
>> +return 0;
>> +if (state & HRTIMER_STATE_CALLBACK)
>> +return -1;
>>
>>  base = lock_hrtimer_base(timer, &flags);
>>
>> +ret = -1;
>>  if (!hrtimer_callback_running(timer))
>>  ret = remove_hrtimer(timer, base);
> Wouldn’t this change would cause cancellations never to succeed (the first 
> check would always be true if the timer is active)?

Ehm, there is a missing ! in that first "if".

>>> Last, I think that having less interrupts on deadline changes is not
>>> completely according to the SDM which says: "If software disarms the
>>> timer or postpones the deadline, race conditions may result in the
>>> delivery of a spurious timer interrupt.” It never says interrupts may
>>> be lost if you reprogram the deadline before you check it expired.
>>
>> But the case when you rewrite the same value to the MSR is neither
>> disarming nor postponing.  You would be getting two interrupts for the
>> same event.  That is why I agree with Radim that checking host_initiated
>> is wrong.
> 
> I understand, and Radim's solution seems functionally fine, now that I am 
> fully awake and understand it.
> I still think that if tscdeadline > guest_tsc, then reprogramming the 
> deadline with the same value, as QEMU does, would result in unwarranted 
> overhead.

The overhead is about two atomic operations (70 clock cycles?).  Still,
there are plenty of other micro-optimizations possible:

1) instead of incrementing timer->pending, set it to 1

2) change it to test_and_set_bit and only set PENDING_TIMER if the
result was zero

3) non-atomically test PENDING_TIMER before (atomically) clearing it

4) return bool from kvm_inject_apic_timer_irqs and only clear
PENDING_TIMER if a timer interrupt was actually injected.

(1) or (2) would remove one atomic operation when reprogramming a passed
deadline with the same value.  (3) or (4) would remove one atomic
operation in the case where the cause of the exit is not an expired
timer.  Any takers?

> Perhaps it would be enough not to reprogram the timer if tscdeadline value 
> does not change (by either guest or host).

Yes, and that would just be your patch without host_initiated.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] KVM: MMU: reload request from GET_DIRTY_LOG path

2014-10-10 Thread Gleb Natapov
On Wed, Oct 08, 2014 at 04:22:31PM -0300, Marcelo Tosatti wrote:
> > > 
> > > Argh, lets try again:
> > > 
> > > skip_pinned = true
> > > --
> > > 
> > > mark page dirty, keep spte intact
> > > 
> > > called from get dirty log path.
> > > 
> > > skip_pinned = false
> > > ---
> > > reload remote mmu
> > > destroy pinned spte.
> > > 
> > > called from: dirty log enablement, rmap write protect (unused for pinned
> > > sptes)
> > > 
> > > 
> > > Note this behaviour is your suggestion:
> > 
> > Yes, I remember that and I thought we will not need this skip_pinned
> > at all.  For rmap write protect case there shouldn't be any pinned pages,
> > but why dirty log enablement sets skip_pinned to false? Why not mark
> > pinned pages as dirty just like you do in get dirty log path?
> 
> Because if its a large spte, it must be nuked (or marked read-only,
> which for pinned sptes, is not possible).
> 
If a large page has one small page pinned inside it its spte will
be marked as pinned, correct?  We did nuke large ptes here until very
recently: c126d94f2c90ed9d, but we cannot drop a pte here anyway without
kicking all vcpu from a guest mode, but do you need additional skip_pinned
parameter?  Why not check if spte is large instead?

So why not have per slot pinned page list (Xiao suggested the same) and do:

spte_write_protect() {
  if (is_pinned(spte) {
if (large(spte))
   // cannot drop while vcpu are running
   mmu_reload_pinned_vcpus();
else
   return false;
}


get_dirty_log() {
 for_each(pinned pages i)
makr_dirty(i);
}

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] KVM: x86: Using TSC deadline may cause multiple interrupts by user writes

2014-10-10 Thread Nadav Amit

On Oct 10, 2014, at 12:45 PM, Paolo Bonzini  wrote:

> Il 10/10/2014 03:55, Nadav Amit ha scritto:
>>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>>> index b8345dd..51428dd 100644
>>> --- a/arch/x86/kvm/lapic.c
>>> +++ b/arch/x86/kvm/lapic.c
>>> @@ -1096,9 +1096,12 @@ static void start_apic_timer(struct kvm_lapic *apic)
>>> if (likely(tscdeadline > guest_tsc)) {
>>> ns = (tscdeadline - guest_tsc) * 100ULL;
>>> do_div(ns, this_tsc_khz);
>>> +   hrtimer_start(&apic->lapic_timer.timer,
>>> +   ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
>>> +   } else {
>>> +   atomic_inc(&ktimer->pending);
>>> +   kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu);
>>> }
>>> -   hrtimer_start(&apic->lapic_timer.timer,
>>> -   ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
>>> 
>>> local_irq_restore(flags);
>>> }
>>> @@ -1355,9 +1358,6 @@ void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu 
>>> *vcpu, u64 data)
>>> return;
>>> 
>>> hrtimer_cancel(&apic->lapic_timer.timer);
>>> -   /* Inject here so clearing tscdeadline won't override new value */
>>> -   if (apic_has_pending_timer(vcpu))
>>> -   kvm_inject_apic_timer_irqs(vcpu);
>>> apic->lapic_timer.tscdeadline = data;
>>> start_apic_timer(apic);
>>> }
>> 
>> Perhaps I am missing something, but I don’t see how it solves the problem I 
>> encountered.
>> Recall the scenario:
>> 1. A TSC deadline timer interrupt is pending.
>> 2. TSC deadline was still not cleared (which happens during vcpu_run).
>> 3. Userspace uses KVM_GET_MSRS/KVM_SET_MSRS to load the same deadline msr.
>> 
>> It appears that in such scenario the guest would still get spurious
>> interrupt for no reason, as ktimer->pending may already be increased in
>> apic_timer_fn.
> 
> If ktimer->pending == 2 you still get only one interrupt, not two.  Am I
> misunderstanding your objection?
You understood me, and I was wrong...

> 
>> Second, I think that the solution I proposed would perform better.
>> Currently, there are many unnecessary cancellations and setups of the
>> timer. This solution does not resolve this problem.
> 
> I think it does.  You do not get an hrtimer_start if tscdeadline <=
> guest_tsc.  To avoid useless cancels, either check hrtimer_is_enqueued
> before calling hrtimer_cancel, or go straight to the source and avoid
> taking the lock in the easy cases:
> 
> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
> index 1c2fe7de2842..6ce725007424 100644
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -1043,10 +1043,17 @@ int hrtimer_try_to_cancel(struct hrtimer *timer)
> {
>   struct hrtimer_clock_base *base;
>   unsigned long flags;
> - int ret = -1;
> + unsigned long state = timer->state;
> + int ret;
> +
> + if (state & HRTIMER_STATE_ENQUEUED)
> + return 0;
> + if (state & HRTIMER_STATE_CALLBACK)
> + return -1;
> 
>   base = lock_hrtimer_base(timer, &flags);
> 
> + ret = -1;
>   if (!hrtimer_callback_running(timer))
>   ret = remove_hrtimer(timer, base);
Wouldn’t this change would cause cancellations never to succeed (the first 
check would always be true if the timer is active)?

> 
> 
>> Last, I think that having less interrupts on deadline changes is not
>> completely according to the SDM which says: "If software disarms the
>> timer or postpones the deadline, race conditions may result in the
>> delivery of a spurious timer interrupt.” It never says interrupts may
>> be lost if you reprogram the deadline before you check it expired.
> 
> But the case when you rewrite the same value to the MSR is neither
> disarming nor postponing.  You would be getting two interrupts for the
> same event.  That is why I agree with Radim that checking host_initiated
> is wrong.
I understand, and Radim's solution seems functionally fine, now that I am fully 
awake and understand it.
I still think that if tscdeadline > guest_tsc, then reprogramming the deadline 
with the same value, as QEMU does, would result in unwarranted overhead.
Perhaps it would be enough not to reprogram the timer if tscdeadline value does 
not change (by either guest or host).

Thanks,
Nadav

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] KVM: x86: Using TSC deadline may cause multiple interrupts by user writes

2014-10-10 Thread Radim Krčmář
2014-10-10 11:45+0200, Paolo Bonzini:
> Il 10/10/2014 03:55, Nadav Amit ha scritto:
> >> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> >> index b8345dd..51428dd 100644
> >> --- a/arch/x86/kvm/lapic.c
> >> +++ b/arch/x86/kvm/lapic.c
> >> @@ -1096,9 +1096,12 @@ static void start_apic_timer(struct kvm_lapic *apic)
> >>if (likely(tscdeadline > guest_tsc)) {
> >>ns = (tscdeadline - guest_tsc) * 100ULL;
> >>do_div(ns, this_tsc_khz);
> >> +  hrtimer_start(&apic->lapic_timer.timer,
> >> +  ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
> >> +  } else {
> >> +  atomic_inc(&ktimer->pending);
> >> +  kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu);
> >>}
> >> -  hrtimer_start(&apic->lapic_timer.timer,
> >> -  ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
> >>
> >>local_irq_restore(flags);
> >>}
> >> @@ -1355,9 +1358,6 @@ void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu 
> >> *vcpu, u64 data)
> >>return;
> >>
> >>hrtimer_cancel(&apic->lapic_timer.timer);
> >> -  /* Inject here so clearing tscdeadline won't override new value */
> >> -  if (apic_has_pending_timer(vcpu))
> >> -  kvm_inject_apic_timer_irqs(vcpu);
> >>apic->lapic_timer.tscdeadline = data;
> >>start_apic_timer(apic);
> >> }
> > 
> > Perhaps I am missing something, but I don’t see how it solves the problem I 
> > encountered.
> > Recall the scenario:
> > 1. A TSC deadline timer interrupt is pending.
> > 2. TSC deadline was still not cleared (which happens during vcpu_run).
> > 3. Userspace uses KVM_GET_MSRS/KVM_SET_MSRS to load the same deadline msr.
> > 
> > It appears that in such scenario the guest would still get spurious
> > interrupt for no reason, as ktimer->pending may already be increased in
> > apic_timer_fn.
> 
> If ktimer->pending == 2 you still get only one interrupt, not two.  Am I
> misunderstanding your objection?

(I don't see it either, the patch removed kvm_inject_apic_timer_irqs(),
 so we inject as if the MSR write didn't happen.)

> > Second, I think that the solution I proposed would perform better.
> > Currently, there are many unnecessary cancellations and setups of the
> > timer. This solution does not resolve this problem.
> 
> I think it does.  You do not get an hrtimer_start if tscdeadline <=
> guest_tsc.  To avoid useless cancels, either check hrtimer_is_enqueued
> before calling hrtimer_cancel, or go straight to the source and avoid
> taking the lock in the easy cases:

I think the useless cancels were when the timer is set to the same value
in the future.

Which makes sense to optimize, but I wasn't sure how it would affect the
guest if the TSC offset was be changing or TSC itself wasn't stable ...
so I chose not to do it.

> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
> index 1c2fe7de2842..6ce725007424 100644
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -1043,10 +1043,17 @@ int hrtimer_try_to_cancel(struct hrtimer *timer)
>  {
>   struct hrtimer_clock_base *base;
>   unsigned long flags;
> - int ret = -1;
> + unsigned long state = timer->state;
> + int ret;
> +
> + if (state & HRTIMER_STATE_ENQUEUED)
> + return 0;

It doesn't try very hard to cancel it ;)

> + if (state & HRTIMER_STATE_CALLBACK)
> + return -1;
> 
>   base = lock_hrtimer_base(timer, &flags);
> 
> + ret = -1;
>   if (!hrtimer_callback_running(timer))
>   ret = remove_hrtimer(timer, base);
> 
> > Last, I think that having less interrupts on deadline changes is not
> > completely according to the SDM which says: "If software disarms the
> > timer or postpones the deadline, race conditions may result in the
> > delivery of a spurious timer interrupt.” It never says interrupts may
> > be lost if you reprogram the deadline before you check it expired.

Yeah, too bad it wasn't written in a formally defined language ...
They could be just reserving design space for concurrent implementation,
not making race conditions mandatory.

Our race windows are much bigger -- a disarm that is hundreds of cycles
in the future can easily be hit after a msr write vm-exit, but it would
be very unlikely to get an interrupt on bare metal.

And thinking about the meaning of postpone or disarm -- software doesn't
want the old interrupt anymore, so it would just add useless work.

(And I liked the code better without it, but it'd be ok if we fixed all
 the cases.)

> But the case when you rewrite the same value to the MSR is neither
> disarming nor postponing.  You would be getting two interrupts for the
> same event.  That is why I agree with Radim that checking host_initiated
> is wrong.

Current implementation also injects two interrupts when the timer is set
to a lower nonzero value, which should give us just one under both
interpretations.
--
To unsubscr

Re: [question] Is there a plan to introduce a unified co-scheduling mechanism to CFS ?

2014-10-10 Thread Zhang Haoyu

>> Hi,
>>
>> Is it worthy to introduce a unified co-scheduling mechanism to CFS ?
>> Because multiple cooperating threads or tasks frequently synchronize 
with each other,
>> not executing them concurrently would only increase the latency of 
synchronization.
>> For example, a thread blocking in spinlock to waiting for another 
thread to release the same spinlock
>> might reduce its waiting time by being executed concurrently with 
the thread which hold the same spinlock.
>> In virtualization scenario, multiple vcpus (which belong to the same 
vm) co-scheduling is more desired

>> when several cooperating threads/task is running in guest.
>>
>> Is there a plane for this work?
>
> Please refer to gang scheduler.
>
Is there a mechanism to dynamically detect which vcpus belong to the 
same gang?
Maybe a cooperative degree can be used to decide the threshold of which 
vcpus belong to the same gang, just a wild thought.


> Regards,
> Wanpeng Li
>>
>> Thanks,
>> Zhang Haoyu

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [question] Is there a plan to introduce a unified co-scheduling mechanism to CFS ?

2014-10-10 Thread Wanpeng Li


于 10/10/14, 7:37 PM, Zhang Haoyu 写道:

Hi,

Is it worthy to introduce a unified co-scheduling mechanism to CFS ?
Because multiple cooperating threads or tasks frequently synchronize 
with each other,
not executing them concurrently would only increase the latency of 
synchronization.
For example, a thread blocking in spinlock to waiting for another 
thread to release the same spinlock
might reduce its waiting time by being executed concurrently with the 
thread which hold the same spinlock.
In virtualization scenario, multiple vcpus (which belong to the same 
vm) co-scheduling is more desired

when several cooperating threads/task is running in guest.

Is there a plane for this work?


Please refer to gang scheduler.

Regards,
Wanpeng Li



Thanks,
Zhang Haoyu
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[question] Is there a plan to introduce a unified co-scheduling mechanism to CFS ?

2014-10-10 Thread Zhang Haoyu

Hi,

Is it worthy to introduce a unified co-scheduling mechanism to CFS ?
Because multiple cooperating threads or tasks frequently synchronize 
with each other,
not executing them concurrently would only increase the latency of 
synchronization.
For example, a thread blocking in spinlock to waiting for another thread 
to release the same spinlock
might reduce its waiting time by being executed concurrently with the 
thread which hold the same spinlock.
In virtualization scenario, multiple vcpus (which belong to the same vm) 
co-scheduling is more desired

when several cooperating threads/task is running in guest.

Is there a plane for this work?

Thanks,
Zhang Haoyu
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 0/3] arm/arm64: KVM: Host 48-bit VA support and IPA limits

2014-10-10 Thread Christoffer Dall
The following two patches fixup some missing memory handling in
KVM/arm64.

The first patch supports 48 bit virtual address space which involves
supporting a different number of levels of page tables in the host
kernel and the stage-2 page tables.

The second patch ensures userspace cannot create memory slots with too
large IPA space given VTCR_EL2.T0SZ = 24.

Finally, we enable 48-bit VA support in Linux.

The following host configurations have been tested with KVM on APM
Mustang:

 1)  4KB  + 39 bits VA space
 2)  4KB  + 48 bits VA space
 3) 64KB  + 39 bits VA space
 4) 64KB  + 48 bits VA space

Tested on TC2 for regressions.

Christoffer Dall (3):
  arm64: KVM: Implement 48 VA support for KVM EL2 and Stage-2
  arm/arm64: KVM: Ensure memslots are within KVM_PHYS_SIZE
  arm64: Allow 48-bits VA space without ARM_SMMU

 arch/arm/include/asm/kvm_mmu.h   |  29 +++-
 arch/arm/kvm/arm.c   |   2 +-
 arch/arm/kvm/mmu.c   | 142 +++
 arch/arm64/Kconfig   |   2 +-
 arch/arm64/include/asm/kvm_mmu.h | 125 +++---
 5 files changed, 258 insertions(+), 42 deletions(-)

-- 
2.1.2.330.g565301e.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 3/3] arm64: Allow 48-bits VA space without ARM_SMMU

2014-10-10 Thread Christoffer Dall
Now when KVM has been reworked to support 48-bits host VA space, we can
allow systems to be configured with this option.  However, the ARM SMMU
driver also needs to be tweaked for 48-bit support so only allow the
config option to be set when not including support for theSMMU.

Signed-off-by: Christoffer Dall 
---
 arch/arm64/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index fd4e81a..a76c6c3b 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -200,7 +200,7 @@ config ARM64_VA_BITS_42
 
 config ARM64_VA_BITS_48
bool "48-bit"
-   depends on BROKEN
+   depends on !ARM_SMMU
 
 endchoice
 
-- 
2.1.2.330.g565301e.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 2/3] arm/arm64: KVM: Ensure memslots are within KVM_PHYS_SIZE

2014-10-10 Thread Christoffer Dall
When creating or moving a memslot, make sure the IPA space is within the
addressable range of the guest.  Otherwise, user space can create too
large a memslot and KVM would try to access potentially unallocated page
table entries when inserting entries in the Stage-2 page tables.

Signed-off-by: Christoffer Dall 
---
 arch/arm/kvm/mmu.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 6e87233..d664bff 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -994,6 +994,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
goto out_unlock;
}
 
+   /* Userspace should not be able to register out-of-bounds IPAs */
+   VM_BUG_ON(fault_ipa >= KVM_PHYS_SIZE);
+
ret = user_mem_abort(vcpu, fault_ipa, memslot, hva, fault_status);
if (ret == 0)
ret = 1;
@@ -1218,6 +1221,11 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
   struct kvm_userspace_memory_region *mem,
   enum kvm_mr_change change)
 {
+   if (change == KVM_MR_CREATE || change == KVM_MR_MOVE) {
+   if (memslot->base_gfn + memslot->npages >=
+   (KVM_PHYS_SIZE >> PAGE_SHIFT))
+   return -EFAULT;
+   }
return 0;
 }
 
-- 
2.1.2.330.g565301e.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 1/3] arm64: KVM: Implement 48 VA support for KVM EL2 and Stage-2

2014-10-10 Thread Christoffer Dall
This patch adds the necessary support for all host kernel PGSIZE and
VA_SPACE configuration options for both EL2 and the Stage-2 page tables.

However, for 40bit and 42bit PARange systems, the architecture mandates
that VTCR_EL2.SL0 is maximum 1, resulting in fewer levels of stage-2
pagge tables than levels of host kernel page tables.  At the same time,
systems with a PARange > 42bit, we limit the IPA range by always setting
VTCR_EL2.T0SZ to 24.

To solve the situation with different levels of page tables for Stage-2
translation than the host kernel page tables, we allocate a dummy PGD
with pointers to our actual inital level Stage-2 page table, in order
for us to reuse the kernel pgtable manipulation primitives.  Reproducing
all these in KVM does not look pretty and unnecessarily complicates the
32-bit side.

Systems with a PARange < 40bits are not yet supported.

 [ I have reworked this patch from its original form submitted by
   Jungseok to take the architecture constraints into consideration.
   There were too many changes from the original patch for me to
   preserve the authorship.  Thanks to Catalin Marinas for his help in
   figuring out a good solution to this challenge.  I have also fixed
   various bugs and missing error code handling from the original
   patch. - Christoffer ]

Cc: Marc Zyngier 
Cc: Catalin Marinas 
Signed-off-by: Jungseok Lee 
Signed-off-by: Christoffer Dall 
---
Changes [v2 -> v3]:
 - Calculate PTRS_PER_S2_PGD and KVM_PREALLOC_LEVEL based on PGDIR_SHIFT
   instead of complicated conditionals.
 - Add explanatory comments to the above definitions
 - Single definition of kvm_prealloc_hwpgd() and friends
 - Integrated tests for KVM_PREALLOC_LEVEL into kvm_pXd_table_empty()

Changes [v1 -> v2]:
 - Use KVM_PREALLOC_LEVELS directly instead of C-variable indirection
 - Factored out the config changes to separate patch
 - Use __GFP_ZERO instead of memset
 - Fixed error return path in kvm_alloc_stage2_pgd()
 - Added WARN_ON if pgd_none() returns true
 - Changed some macro definitions and names

 arch/arm/include/asm/kvm_mmu.h   |  29 -
 arch/arm/kvm/arm.c   |   2 +-
 arch/arm/kvm/mmu.c   | 134 +++
 arch/arm64/include/asm/kvm_mmu.h | 125 +---
 4 files changed, 249 insertions(+), 41 deletions(-)

diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 3f688b4..3b0fdd0 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -37,6 +37,11 @@
  */
 #define TRAMPOLINE_VA  UL(CONFIG_VECTORS_BASE)
 
+/*
+ * KVM_MMU_CACHE_MIN_PAGES is the number of stage2 page table translation 
levels.
+ */
+#define KVM_MMU_CACHE_MIN_PAGES2
+
 #ifndef __ASSEMBLY__
 
 #include 
@@ -83,6 +88,11 @@ static inline void kvm_clean_pgd(pgd_t *pgd)
clean_dcache_area(pgd, PTRS_PER_S2_PGD * sizeof(pgd_t));
 }
 
+static inline void kvm_clean_pmd(pmd_t *pmd)
+{
+   clean_dcache_area(pmd, PTRS_PER_PMD * sizeof(pmd_t));
+}
+
 static inline void kvm_clean_pmd_entry(pmd_t *pmd)
 {
clean_pmd_entry(pmd);
@@ -123,10 +133,23 @@ static inline bool kvm_page_empty(void *ptr)
 }
 
 
-#define kvm_pte_table_empty(ptep) kvm_page_empty(ptep)
-#define kvm_pmd_table_empty(pmdp) kvm_page_empty(pmdp)
-#define kvm_pud_table_empty(pudp) (0)
+#define kvm_pte_table_empty(kvm, ptep) kvm_page_empty(ptep)
+#define kvm_pmd_table_empty(kvm, pmdp) kvm_page_empty(pmdp)
+#define kvm_pud_table_empty(kvm, pudp) (0)
+
+#define KVM_PREALLOC_LEVEL 0
 
+static inline int kvm_prealloc_hwpgd(struct kvm *kvm, pgd_t *pgd)
+{
+   return 0;
+}
+
+static inline void kvm_free_hwpgd(struct kvm *kvm) { }
+
+static inline void *kvm_get_hwpgd(struct kvm *kvm)
+{
+   return kvm->arch.pgd;
+}
 
 struct kvm;
 
diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index 45e5f67..9e193c8 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -409,7 +409,7 @@ static void update_vttbr(struct kvm *kvm)
kvm_next_vmid++;
 
/* update vttbr to be used with the new vmid */
-   pgd_phys = virt_to_phys(kvm->arch.pgd);
+   pgd_phys = virt_to_phys(kvm_get_hwpgd(kvm));
BUG_ON(pgd_phys & ~VTTBR_BADDR_MASK);
vmid = ((u64)(kvm->arch.vmid) << VTTBR_VMID_SHIFT) & VTTBR_VMID_MASK;
kvm->arch.vttbr = pgd_phys | vmid;
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index eea0306..6e87233 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -42,7 +42,7 @@ static unsigned long hyp_idmap_start;
 static unsigned long hyp_idmap_end;
 static phys_addr_t hyp_idmap_vector;
 
-#define pgd_order get_order(PTRS_PER_PGD * sizeof(pgd_t))
+#define hyp_pgd_order get_order(PTRS_PER_PGD * sizeof(pgd_t))
 
 #define kvm_pmd_huge(_x)   (pmd_huge(_x) || pmd_trans_huge(_x))
 
@@ -134,7 +134,7 @@ static void unmap_ptes(struct kvm *kvm, pmd_t *pmd,
}
} while (pte++, addr += PAGE_SIZE, addr != end);
 
-   if (kvm_pte_table_empty(

Re: [PATCH 5/5] KVM: x86: Using TSC deadline may cause multiple interrupts by user writes

2014-10-10 Thread Paolo Bonzini
Il 10/10/2014 03:55, Nadav Amit ha scritto:
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index b8345dd..51428dd 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -1096,9 +1096,12 @@ static void start_apic_timer(struct kvm_lapic *apic)
>>  if (likely(tscdeadline > guest_tsc)) {
>>  ns = (tscdeadline - guest_tsc) * 100ULL;
>>  do_div(ns, this_tsc_khz);
>> +hrtimer_start(&apic->lapic_timer.timer,
>> +ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
>> +} else {
>> +atomic_inc(&ktimer->pending);
>> +kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu);
>>  }
>> -hrtimer_start(&apic->lapic_timer.timer,
>> -ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
>>
>>  local_irq_restore(flags);
>>  }
>> @@ -1355,9 +1358,6 @@ void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu 
>> *vcpu, u64 data)
>>  return;
>>
>>  hrtimer_cancel(&apic->lapic_timer.timer);
>> -/* Inject here so clearing tscdeadline won't override new value */
>> -if (apic_has_pending_timer(vcpu))
>> -kvm_inject_apic_timer_irqs(vcpu);
>>  apic->lapic_timer.tscdeadline = data;
>>  start_apic_timer(apic);
>> }
> 
> Perhaps I am missing something, but I don’t see how it solves the problem I 
> encountered.
> Recall the scenario:
> 1. A TSC deadline timer interrupt is pending.
> 2. TSC deadline was still not cleared (which happens during vcpu_run).
> 3. Userspace uses KVM_GET_MSRS/KVM_SET_MSRS to load the same deadline msr.
> 
> It appears that in such scenario the guest would still get spurious
> interrupt for no reason, as ktimer->pending may already be increased in
> apic_timer_fn.

If ktimer->pending == 2 you still get only one interrupt, not two.  Am I
misunderstanding your objection?

> Second, I think that the solution I proposed would perform better.
> Currently, there are many unnecessary cancellations and setups of the
> timer. This solution does not resolve this problem.

I think it does.  You do not get an hrtimer_start if tscdeadline <=
guest_tsc.  To avoid useless cancels, either check hrtimer_is_enqueued
before calling hrtimer_cancel, or go straight to the source and avoid
taking the lock in the easy cases:

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 1c2fe7de2842..6ce725007424 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1043,10 +1043,17 @@ int hrtimer_try_to_cancel(struct hrtimer *timer)
 {
struct hrtimer_clock_base *base;
unsigned long flags;
-   int ret = -1;
+   unsigned long state = timer->state;
+   int ret;
+
+   if (state & HRTIMER_STATE_ENQUEUED)
+   return 0;
+   if (state & HRTIMER_STATE_CALLBACK)
+   return -1;

base = lock_hrtimer_base(timer, &flags);

+   ret = -1;
if (!hrtimer_callback_running(timer))
ret = remove_hrtimer(timer, base);


> Last, I think that having less interrupts on deadline changes is not
> completely according to the SDM which says: "If software disarms the
> timer or postpones the deadline, race conditions may result in the
> delivery of a spurious timer interrupt.” It never says interrupts may
> be lost if you reprogram the deadline before you check it expired.

But the case when you rewrite the same value to the MSR is neither
disarming nor postponing.  You would be getting two interrupts for the
same event.  That is why I agree with Radim that checking host_initiated
is wrong.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/3] arm64: KVM: Implement 48 VA support for KVM EL2 and Stage-2

2014-10-10 Thread Christoffer Dall
On Thu, Oct 09, 2014 at 02:36:26PM +0100, Catalin Marinas wrote:
> On Thu, Oct 09, 2014 at 12:01:37PM +0100, Christoffer Dall wrote:
> > On Wed, Oct 08, 2014 at 10:47:04AM +0100, Catalin Marinas wrote:
> > > On Tue, Oct 07, 2014 at 08:39:54PM +0100, Christoffer Dall wrote:
> > > > +static inline int kvm_prealloc_hwpgd(struct kvm *kvm, pgd_t *pgd)
> > > > +{
> > > > +   pud_t *pud;
> > > > +   pmd_t *pmd;
> > > > +   unsigned int order, i;
> > > > +   unsigned long hwpgd;
> > > > +
> > > > +   if (KVM_PREALLOC_LEVEL == 0)
> > > > +   return 0;
> > > > +
> > > > +   order = get_order(PTRS_PER_S2_PGD);
> > > 
> > > Isn't order always 0 here? Based on our IRC discussion, PTRS_PER_S2_PGD
> > > is 16 or less and the order should not be used.
> > 
> > no, if the kernel has 4K pages and 4 levels, then PGDIR_SHIFT is 39, and
> > KVM_PHYS_SHIFT stays 40, so that means PTRS_PER_S2_PGD becomes 2, which
> > means we concatenate two first level stage-2 page tables, which means we
> > need to allocate two consecutive pages, giving us an order of 1, not 0.
> 
> So if PTRS_PER_S2_PGD is 2, how come get_order(PTRS_PER_S2_PGD) == 1? My
> reading of the get_order() macro is that get_order(2) == 0.
> 
> Did you mean get_order(PTRS_PER_S2_PGD * PAGE_SIZE)?

Ah, you're right.  Sorry.  Yes, that's what I meant.

> 
> Or you could define a PTRS_PER_S2_PGD_SHIFT as (KVM_PHYS_SHIFT -
> PGDIR_SHIFT) and use this as the order directly.
> 

That's better.  I also experimented with defining S2_HWPGD_ORDER or
S2_PREALLOC_ORDER, but it didn't look much clear, so sticking with
PTRS_PER_S2_PGD_SHIFT.

> > > > +   hwpgd = __get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
> > > 
> > > I assume you need __get_free_pages() for alignment.
> > 
> > yes, would you prefer a comment to that fact?
> 
> No, that's fine.
> 

Thanks,
-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html