date:20151015

Re: sanitizing kvmtool

2015-10-15 Thread Dmitry Vyukov

removed bad email address

On Thu, Oct 15, 2015 at 12:21 PM, Dmitry Vyukov  wrote:
> Hello,
>
> I've run a set of sanitizers on
> git://git.kernel.org/pub/scm/linux/kernel/git/will/kvmtool.git commit
> 3695adeb227813d96d9c41850703fb53a23845eb. I've just booted a VM and
> shut it down.
>
> AddressSanitizer detected a heap-use-after-free:
>
> AddressSanitizer: heap-use-after-free on address 0x6040df90 at pc
> 0x004f46d0 bp 0x7ffc79def2d0 sp 0x7ffc79def2c8
> READ of size 8 at 0x6040df90 thread T0
> #0 0x4f46cf in kvm__pause kvm.c:436:7
> #1 0x4f0d5d in ioport__unregister ioport.c:129:2
> #2 0x4efb2f in serial8250__exit hw/serial.c:446:7
> #3 0x516204 in init_list__exit util/init.c:59:8
> #4 0x4ea956 in kvm_cmd_run_exit builtin-run.c:645:2
> #5 0x4ea956 in kvm_cmd_run builtin-run.c:661
> #6 0x51596f in handle_command kvm-cmd.c:84:8
> #7 0x7fa398101ec4 in __libc_start_main
> /build/buildd/eglibc-2.19/csu/libc-start.c:287
> #8 0x41a505 in _start (lkvm+0x41a505)
>
> 0x6040df90 is located 0 bytes inside of 40-byte region
> [0x6040df90,0x6040dfb8)
> freed by thread T0 here:
> #0 0x4b75a0 in free
> /usr/local/google/home/dvyukov/src/llvm/build/../projects/compiler-rt/lib/asan/asan_malloc_linux.cc:30
> #1 0x4f29e6 in kvm_cpu__exit kvm-cpu.c:263:2
> #2 0x5160c4 in init_list__exit util/init.c:59:8
>
> previously allocated by thread T0 here:
> #0 0x4b7a2c in calloc
> /usr/local/google/home/dvyukov/src/llvm/build/../projects/compiler-rt/lib/asan/asan_malloc_linux.cc:56
> #1 0x4f2491 in kvm_cpu__init kvm-cpu.c:221:14
>
>
> ThreadSanitizer detected a whole bunch of data races, for example:
>
> WARNING: ThreadSanitizer: data race (pid=109228)
>   Write of size 1 at 0x7d6c0001f384 by thread T55:
> #0 memcpy 
> /usr/local/google/home/dvyukov/src/llvm/build/../projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:608
> (lkvm+0x0044b28b)
> #1 virtio_pci__msix_mmio_callback virtio/pci.c:269:3 (lkvm+0x004b3ee0)
> #2 kvm__emulate_mmio mmio.c:131:3 (lkvm+0x004ac332)
> #3 kvm_cpu__emulate_mmio x86/include/kvm/kvm-cpu-arch.h:46:9
> (lkvm+0x004aa8c6)
> #4 kvm_cpu__start kvm-cpu.c:147 (lkvm+0x004aa8c6)
> #5 kvm_cpu_thread builtin-run.c:174:6 (lkvm+0x004a6e3e)
>
>   Previous read of size 4 at 0x7d6c0001f384 by thread T58:
> #0 virtio_pci__signal_vq virtio/pci.c:290:29 (lkvm+0x004b36b6)
> #1 virtio_net_tx_thread virtio/net.c:210:4 (lkvm+0x004b1fb5)
>
>   Location is heap block of size 1648 at 0x7d6c0001f100 allocated by
> main thread:
> #0 calloc 
> /usr/local/google/home/dvyukov/src/llvm/build/../projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:544
> (lkvm+0x0043e812)
> #1 virtio_init virtio/core.c:191:12 (lkvm+0x004afa48)
> #2 virtio_net__init_one virtio/net.c:846:2 (lkvm+0x004b095d)
> #3 virtio_net__init virtio/net.c:868:3 (lkvm+0x004b0296)
> #4 init_list__init util/init.c:40:8 (lkvm+0x004bc7ee)
> #5 kvm_cmd_run_init builtin-run.c:621:6 (lkvm+0x004a6799)
> #6 kvm_cmd_run builtin-run.c:656 (lkvm+0x004a6799)
> #7 handle_command kvm-cmd.c:84:8 (lkvm+0x004bc40c)
> #8 handle_kvm_command main.c:11:9 (lkvm+0x004ac0b4)
> #9 main main.c:18 (lkvm+0x004ac0b4)
>
>   Thread T55 'kvm-vcpu-2' (tid=109285, running) created by main thread at:
> #0 pthread_create
> /usr/local/google/home/dvyukov/src/llvm/build/../projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:848
> (lkvm+0x004478a3)
> #1 kvm_cmd_run_work builtin-run.c:633:7 (lkvm+0x004a683f)
> #2 kvm_cmd_run builtin-run.c:660 (lkvm+0x004a683f)
> #3 handle_command kvm-cmd.c:84:8 (lkvm+0x004bc40c)
> #4 handle_kvm_command main.c:11:9 (lkvm+0x004ac0b4)
> #5 main main.c:18 (lkvm+0x004ac0b4)
>
>   Thread T58 'virtio-net-tx' (tid=109334, running) created by thread T53 at:
> #0 pthread_create
> /usr/local/google/home/dvyukov/src/llvm/build/../projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:848
> (lkvm+0x004478a3)
> #1 init_vq virtio/net.c:526:4 (lkvm+0x004b1523)
> #2 virtio_pci__io_out virtio/pci.c:219:3 (lkvm+0x004b484c)
> #3 kvm__emulate_io ioport.c:196:11 (lkvm+0x004aa0f8)
> #4 virtio_pci__io_mmio_callback virtio/pci.c:340:2 (lkvm+0x004b3e55)
> #5 kvm__emulate_mmio mmio.c:131:3 (lkvm+0x004ac332)
> #6 kvm_cpu__emulate_mmio x86/include/kvm/kvm-cpu-arch.h:46:9
> (lkvm+0x004aa8c6)
> #7 kvm_cpu__start kvm-cpu.c:147 (lkvm+0x004aa8c6)
> #8 kvm_cpu_thread builtin-run.c:174:6 (lkvm+0x004a6e3e)
>
>
> and mutex unlock in a wrong thread (not supported by pthread):
>
> WARNING: ThreadSanitizer: unlock of an unlocked mutex (or by a wrong
> thread) (pid=109228)
> #0 pthread_mutex_unlock
> /usr/local/google/home/dvyukov/src/llvm/build/../projects/compiler-rt/lib/tsan/../sanitizer_common/sanitizer_common_interceptors.inc:3303
>

Re: [PATCH] VFIO: platform: AMD xgbe reset module

2015-10-15 Thread Christoffer Dall

On Thu, Oct 15, 2015 at 01:21:55PM +0200, Arnd Bergmann wrote:
> On Thursday 15 October 2015 10:08:02 Eric Auger wrote:
> > Hi Arnd,
> > On 10/14/2015 05:38 PM, Arnd Bergmann wrote:
> > > On Wednesday 14 October 2015 15:33:12 Eric Auger wrote:
> > >> --- a/drivers/vfio/platform/vfio_platform_common.c
> > >> +++ b/drivers/vfio/platform/vfio_platform_common.c
> > >> @@ -31,6 +31,11 @@ static const struct vfio_platform_reset_combo 
> > >> reset_lookup_table[] = {
> > >> .reset_function_name = 
> > >> "vfio_platform_calxedaxgmac_reset",
> > >> .module_name = "vfio-platform-calxedaxgmac",
> > >> },
> > >> +   {
> > >> +   .compat = "amd,xgbe-seattle-v1a",
> > >> +   .reset_function_name = "vfio_platform_amdxgbe_reset",
> > >> +   .module_name = "vfio-platform-amdxgbe",
> > >> +   },
> > >>  };
> > >>  
> > >>  static void vfio_platform_get_reset(struct vfio_platform_device *vdev,
> > >>
> > > 
> > > This is causing build errors for me when CONFIG_MODULES is disabled.
> > Sorry about that and thanks for reporting the issue
> > > 
> > > Could this please be restructured so vfio_platform_get_reset does
> > > not attempt to call __symbol_get() but instead has the drivers
> > > register themselves properly to a subsystem?
> > OK
> > 
> > Could you elaborate about "has the drivers register themselves properly
> > to a subsystem".
> > 
> > My first proposal when coping with this problematic of being able to add
> > reset plugins to the vfio-platform driver was to create new drivers per
> > device requiring reset. but this was considered painful for end-users,
> > who needed to be aware of the right driver to bind - and I think that
> > makes sense - (https://lkml.org/lkml/2015/4/17/568) .
> 
> Having multiple drivers indeed sucks, but your current approach isn't
> that much better, as you still have two modules that are used to driver
> the same hardware.
> 
> I would expect that the same driver that is used for the normal
> operation and that it calls a function to register itself to vfio
> by passing a structure with the device and reset function pointer.
> 
> > A naive question I dare to ask, wouldn't it be acceptable to make
> > vfio_platform depend on CONFIG_MODULES? Don't we disable modules for
> > security purpose? In that context would we use VFIO?
> 
> I think a lot of embedded systems turn off modules to save a little
> memory, speed up boot time and simplify their user space.
> 
> Aside from that, the current method is highly unusual and looks a bit
> fragile to me, as you are relying on internals of the module loader
> code. It's also a layering violation as the generic code needs to be
> patched for each device specific module that is added, and we try
> to avoid that.
> 
> A possible solution could be something inside the xgbe driver like
> 
> 
> static void xgbe_init_module(void)
> {
>   int ret = 0;
> 
>   if (IS_ENABLED(CONFIG_AMD_XGBE_ETHERNET)
>   ret = platform_driver_register(_driver);
>   if (ret)
>   return ret;
> 
>   if (IS_ENABLED(CONFIG_VFIO_PLATFORM))
>   ret = vfio_platform_register_reset(_of_match, 
> xgbe_platform_reset);
> 
>   return ret; 
> }
> 
> This way you have exactly one driver module that gets loaded for the
> device and you can use it either with the platform_driver or through
> vfio.
> 
> A nicer way that would be a little more work would be to integrate
> the reset infrastructure into 'struct platform_driver' framework,
> by adding another callback to the it for doing the interaction with
> vfio, something like
> 
> enum vfio_platform_op {
>   VFIO_PLATFORM_BIND,
>   VFIO_PLATFORM_UNBIND,
>   VFIO_PLATFORM_RESET,
> };
> 
> struct platform_driver {
> int (*probe)(struct platform_device *);
> int (*remove)(struct platform_device *);
>   ...
>   int (*vfio_manage)(struct platform_device *, enum vfio_platform_op);
> struct device_driver driver;
> };
> 
> This would integrate much more closely into the platform driver framework,
> just like the regular vfio driver integrates into the PCI framework.
> Unlike PCI however, you can't just use the generic driver framework to
> unbind the driver, because you still need device specific code.
> 
Thanks for these suggestions, really helpful.

What I don't understand in the latter example is how VFIO knows which
struct platform_driver to interact with?

Also, just so I'm sure I understand correctly, VFIO_PLATFORM_UNBIND is
then called by VFIO before the VFIO driver unbinds from the device
(unbinding the platform driver from the device being a completely
separate thing)?

Thanks,
-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/5] KVM: x86: MMU: Make force_pt_level bool

2015-10-15 Thread Takuya Yoshikawa

This will be passed to a function later.

Signed-off-by: Takuya Yoshikawa 
---
 arch/x86/kvm/mmu.c |8 
 arch/x86/kvm/paging_tmpl.h |4 ++--
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index b8482c0..2262728 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2962,7 +2962,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, 
u32 error_code,
 {
int r;
int level;
-   int force_pt_level;
+   bool force_pt_level;
pfn_t pfn;
unsigned long mmu_seq;
bool map_writable, write = error_code & PFERR_WRITE_MASK;
@@ -3476,7 +3476,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t 
gpa, u32 error_code,
pfn_t pfn;
int r;
int level;
-   int force_pt_level;
+   bool force_pt_level;
gfn_t gfn = gpa >> PAGE_SHIFT;
unsigned long mmu_seq;
int write = error_code & PFERR_WRITE_MASK;
@@ -3497,9 +3497,9 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t 
gpa, u32 error_code,
 
if (mapping_level_dirty_bitmap(vcpu, gfn) ||
!check_hugepage_cache_consistency(vcpu, gfn, PT_DIRECTORY_LEVEL))
-   force_pt_level = 1;
+   force_pt_level = true;
else
-   force_pt_level = 0;
+   force_pt_level = false;
 
if (likely(!force_pt_level)) {
level = mapping_level(vcpu, gfn);
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 736e6ab..07f1a4e 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -698,7 +698,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t 
addr, u32 error_code,
int r;
pfn_t pfn;
int level = PT_PAGE_TABLE_LEVEL;
-   int force_pt_level;
+   bool force_pt_level;
unsigned long mmu_seq;
bool map_writable, is_self_change_mapping;
 
@@ -747,7 +747,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t 
addr, u32 error_code,
force_pt_level = mapping_level_dirty_bitmap(vcpu, walker.gfn)
   || is_self_change_mapping;
else
-   force_pt_level = 1;
+   force_pt_level = true;
if (!force_pt_level) {
level = min(walker.level, mapping_level(vcpu, walker.gfn));
walker.gfn = walker.gfn & ~(KVM_PAGES_PER_HPAGE(level) - 1);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/5] KVM: x86: MMU: Eliminate extra memory slot searches in page fault handlers

2015-10-15 Thread Takuya Yoshikawa

In page fault handlers, both mapping_level_dirty_bitmap() and mapping_level()
do a memory slot search, binary search, through kvm_vcpu_gfn_to_memslot(), which
may not be negligible especially for virtual machines with many memory slots.

With a bit of cleanup effort, the patch set reduces this overhead.

  [PATCH 1/5] KVM: x86: MMU: Make force_pt_level bool
  [PATCH 2/5] KVM: x86: MMU: Simplify force_pt_level calculation code in 
FNAME(page_fault)()
  [PATCH 3/5] KVM: x86: MMU: Merge mapping_level_dirty_bitmap() into 
mapping_level()
  [PATCH 4/5] KVM: x86: MMU: Remove mapping_level_dirty_bitmap()
  [PATCH 5/5] KVM: x86: MMU: Eliminate an extra memory slot search in 
mapping_level()

Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/5] KVM: x86: MMU: Simplify force_pt_level calculation code in FNAME(page_fault)()

2015-10-15 Thread Takuya Yoshikawa

As a bonus, an extra memory slot search can be eliminated when
is_self_change_mapping is true.

Signed-off-by: Takuya Yoshikawa 
---
 arch/x86/kvm/paging_tmpl.h |   15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 07f1a4e..8ebc3a5 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -743,15 +743,14 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t 
addr, u32 error_code,
is_self_change_mapping = FNAME(is_self_change_mapping)(vcpu,
  , user_fault, >arch.write_fault_to_shadow_pgtable);
 
-   if (walker.level >= PT_DIRECTORY_LEVEL)
-   force_pt_level = mapping_level_dirty_bitmap(vcpu, walker.gfn)
-  || is_self_change_mapping;
-   else
+   if (walker.level >= PT_DIRECTORY_LEVEL && !is_self_change_mapping) {
+   force_pt_level = mapping_level_dirty_bitmap(vcpu, walker.gfn);
+   if (!force_pt_level) {
+   level = min(walker.level, mapping_level(vcpu, 
walker.gfn));
+   walker.gfn = walker.gfn & ~(KVM_PAGES_PER_HPAGE(level) 
- 1);
+   }
+   } else
force_pt_level = true;
-   if (!force_pt_level) {
-   level = min(walker.level, mapping_level(vcpu, walker.gfn));
-   walker.gfn = walker.gfn & ~(KVM_PAGES_PER_HPAGE(level) - 1);
-   }
 
mmu_seq = vcpu->kvm->mmu_notifier_seq;
smp_rmb();
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/5] KVM: x86: MMU: Merge mapping_level_dirty_bitmap() into mapping_level()

2015-10-15 Thread Takuya Yoshikawa

This is necessary to eliminate an extra memory slot search later.

Signed-off-by: Takuya Yoshikawa 
---
 arch/x86/kvm/mmu.c |   29 ++---
 arch/x86/kvm/paging_tmpl.h |6 +++---
 2 files changed, 17 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 2262728..890cd69 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -870,10 +870,16 @@ static bool mapping_level_dirty_bitmap(struct kvm_vcpu 
*vcpu, gfn_t large_gfn)
return !gfn_to_memslot_dirty_bitmap(vcpu, large_gfn, true);
 }
 
-static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn)
+static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn,
+bool *force_pt_level)
 {
int host_level, level, max_level;
 
+   if (likely(!*force_pt_level))
+   *force_pt_level = mapping_level_dirty_bitmap(vcpu, large_gfn);
+   if (unlikely(*force_pt_level))
+   return PT_PAGE_TABLE_LEVEL;
+
host_level = host_mapping_level(vcpu->kvm, large_gfn);
 
if (host_level == PT_PAGE_TABLE_LEVEL)
@@ -2962,14 +2968,13 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t 
v, u32 error_code,
 {
int r;
int level;
-   bool force_pt_level;
+   bool force_pt_level = false;
pfn_t pfn;
unsigned long mmu_seq;
bool map_writable, write = error_code & PFERR_WRITE_MASK;
 
-   force_pt_level = mapping_level_dirty_bitmap(vcpu, gfn);
+   level = mapping_level(vcpu, gfn, _pt_level);
if (likely(!force_pt_level)) {
-   level = mapping_level(vcpu, gfn);
/*
 * This path builds a PAE pagetable - so we can map
 * 2mb pages at maximum. Therefore check if the level
@@ -2979,8 +2984,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, 
u32 error_code,
level = PT_DIRECTORY_LEVEL;
 
gfn &= ~(KVM_PAGES_PER_HPAGE(level) - 1);
-   } else
-   level = PT_PAGE_TABLE_LEVEL;
+   }
 
if (fast_page_fault(vcpu, v, level, error_code))
return 0;
@@ -3495,20 +3499,15 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t 
gpa, u32 error_code,
if (r)
return r;
 
-   if (mapping_level_dirty_bitmap(vcpu, gfn) ||
-   !check_hugepage_cache_consistency(vcpu, gfn, PT_DIRECTORY_LEVEL))
-   force_pt_level = true;
-   else
-   force_pt_level = false;
-
+   force_pt_level = !check_hugepage_cache_consistency(vcpu, gfn,
+  PT_DIRECTORY_LEVEL);
+   level = mapping_level(vcpu, gfn, _pt_level);
if (likely(!force_pt_level)) {
-   level = mapping_level(vcpu, gfn);
if (level > PT_DIRECTORY_LEVEL &&
!check_hugepage_cache_consistency(vcpu, gfn, level))
level = PT_DIRECTORY_LEVEL;
gfn &= ~(KVM_PAGES_PER_HPAGE(level) - 1);
-   } else
-   level = PT_PAGE_TABLE_LEVEL;
+   }
 
if (fast_page_fault(vcpu, gpa, level, error_code))
return 0;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 8ebc3a5..bf39d0f 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -744,9 +744,9 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t 
addr, u32 error_code,
  , user_fault, >arch.write_fault_to_shadow_pgtable);
 
if (walker.level >= PT_DIRECTORY_LEVEL && !is_self_change_mapping) {
-   force_pt_level = mapping_level_dirty_bitmap(vcpu, walker.gfn);
-   if (!force_pt_level) {
-   level = min(walker.level, mapping_level(vcpu, 
walker.gfn));
+   level = mapping_level(vcpu, walker.gfn, _pt_level);
+   if (likely(!force_pt_level)) {
+   level = min(walker.level, level);
walker.gfn = walker.gfn & ~(KVM_PAGES_PER_HPAGE(level) 
- 1);
}
} else
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/5] KVM: x86: MMU: Remove mapping_level_dirty_bitmap()

2015-10-15 Thread Takuya Yoshikawa

Now that it has only one caller, and its name is not so helpful for
readers, just remove it.

Signed-off-by: Takuya Yoshikawa 
---
 arch/x86/kvm/mmu.c |   21 +
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 890cd69..78a3d08 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -851,6 +851,14 @@ static int host_mapping_level(struct kvm *kvm, gfn_t gfn)
return ret;
 }
 
+static inline bool memslot_invalid(struct kvm_memory_slot *slot)
+{
+   if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
+   return true;
+
+   return false;
+}
+
 static struct kvm_memory_slot *
 gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t gfn,
bool no_dirty_log)
@@ -858,25 +866,22 @@ gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t 
gfn,
struct kvm_memory_slot *slot;
 
slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
-   if (!slot || slot->flags & KVM_MEMSLOT_INVALID ||
- (no_dirty_log && slot->dirty_bitmap))
+   if (memslot_invalid(slot) || (no_dirty_log && slot->dirty_bitmap))
slot = NULL;
 
return slot;
 }
 
-static bool mapping_level_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t large_gfn)
-{
-   return !gfn_to_memslot_dirty_bitmap(vcpu, large_gfn, true);
-}
-
 static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn,
 bool *force_pt_level)
 {
int host_level, level, max_level;
+   struct kvm_memory_slot *slot;
+
+   slot = kvm_vcpu_gfn_to_memslot(vcpu, large_gfn);
 
if (likely(!*force_pt_level))
-   *force_pt_level = mapping_level_dirty_bitmap(vcpu, large_gfn);
+   *force_pt_level = memslot_invalid(slot) || slot->dirty_bitmap;
if (unlikely(*force_pt_level))
return PT_PAGE_TABLE_LEVEL;
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] VFIO: platform: AMD xgbe reset module

2015-10-15 Thread Arnd Bergmann

On Thursday 15 October 2015 10:08:02 Eric Auger wrote:
> Hi Arnd,
> On 10/14/2015 05:38 PM, Arnd Bergmann wrote:
> > On Wednesday 14 October 2015 15:33:12 Eric Auger wrote:
> >> --- a/drivers/vfio/platform/vfio_platform_common.c
> >> +++ b/drivers/vfio/platform/vfio_platform_common.c
> >> @@ -31,6 +31,11 @@ static const struct vfio_platform_reset_combo 
> >> reset_lookup_table[] = {
> >> .reset_function_name = "vfio_platform_calxedaxgmac_reset",
> >> .module_name = "vfio-platform-calxedaxgmac",
> >> },
> >> +   {
> >> +   .compat = "amd,xgbe-seattle-v1a",
> >> +   .reset_function_name = "vfio_platform_amdxgbe_reset",
> >> +   .module_name = "vfio-platform-amdxgbe",
> >> +   },
> >>  };
> >>  
> >>  static void vfio_platform_get_reset(struct vfio_platform_device *vdev,
> >>
> > 
> > This is causing build errors for me when CONFIG_MODULES is disabled.
> Sorry about that and thanks for reporting the issue
> > 
> > Could this please be restructured so vfio_platform_get_reset does
> > not attempt to call __symbol_get() but instead has the drivers
> > register themselves properly to a subsystem?
> OK
> 
> Could you elaborate about "has the drivers register themselves properly
> to a subsystem".
> 
> My first proposal when coping with this problematic of being able to add
> reset plugins to the vfio-platform driver was to create new drivers per
> device requiring reset. but this was considered painful for end-users,
> who needed to be aware of the right driver to bind - and I think that
> makes sense - (https://lkml.org/lkml/2015/4/17/568) .

Having multiple drivers indeed sucks, but your current approach isn't
that much better, as you still have two modules that are used to driver
the same hardware.

I would expect that the same driver that is used for the normal
operation and that it calls a function to register itself to vfio
by passing a structure with the device and reset function pointer.

> A naive question I dare to ask, wouldn't it be acceptable to make
> vfio_platform depend on CONFIG_MODULES? Don't we disable modules for
> security purpose? In that context would we use VFIO?

I think a lot of embedded systems turn off modules to save a little
memory, speed up boot time and simplify their user space.

Aside from that, the current method is highly unusual and looks a bit
fragile to me, as you are relying on internals of the module loader
code. It's also a layering violation as the generic code needs to be
patched for each device specific module that is added, and we try
to avoid that.

A possible solution could be something inside the xgbe driver like

static void xgbe_init_module(void)
{
int ret = 0;

if (IS_ENABLED(CONFIG_AMD_XGBE_ETHERNET)
ret = platform_driver_register(_driver);
if (ret)
return ret;

if (IS_ENABLED(CONFIG_VFIO_PLATFORM))
ret = vfio_platform_register_reset(_of_match, 
xgbe_platform_reset);

return ret; 
}

This way you have exactly one driver module that gets loaded for the
device and you can use it either with the platform_driver or through
vfio.

A nicer way that would be a little more work would be to integrate
the reset infrastructure into 'struct platform_driver' framework,
by adding another callback to the it for doing the interaction with
vfio, something like

enum vfio_platform_op {
VFIO_PLATFORM_BIND,
VFIO_PLATFORM_UNBIND,
VFIO_PLATFORM_RESET,
};

struct platform_driver {
int (*probe)(struct platform_device *);
int (*remove)(struct platform_device *);
...
int (*vfio_manage)(struct platform_device *, enum vfio_platform_op);
struct device_driver driver;
};

This would integrate much more closely into the platform driver framework,
just like the regular vfio driver integrates into the PCI framework.
Unlike PCI however, you can't just use the generic driver framework to
unbind the driver, because you still need device specific code.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 5/5] KVM: x86: MMU: Eliminate an extra memory slot search in mapping_level()

2015-10-15 Thread Takuya Yoshikawa

Calling kvm_vcpu_gfn_to_memslot() twice in mapping_level() should be
avoided since getting a slot by binary search may not be negligible,
especially for virtual machines with many memory slots.

Signed-off-by: Takuya Yoshikawa 
---
 arch/x86/kvm/mmu.c |   17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 78a3d08..8d285dc 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -818,14 +818,11 @@ static void unaccount_shadowed(struct kvm *kvm, struct 
kvm_mmu_page *sp)
kvm->arch.indirect_shadow_pages--;
 }
 
-static int has_wrprotected_page(struct kvm_vcpu *vcpu,
-   gfn_t gfn,
-   int level)
+static int __has_wrprotected_page(gfn_t gfn, int level,
+ struct kvm_memory_slot *slot)
 {
-   struct kvm_memory_slot *slot;
struct kvm_lpage_info *linfo;
 
-   slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
if (slot) {
linfo = lpage_info_slot(gfn, slot, level);
return linfo->write_count;
@@ -834,6 +831,14 @@ static int has_wrprotected_page(struct kvm_vcpu *vcpu,
return 1;
 }
 
+static int has_wrprotected_page(struct kvm_vcpu *vcpu, gfn_t gfn, int level)
+{
+   struct kvm_memory_slot *slot;
+
+   slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
+   return __has_wrprotected_page(gfn, level, slot);
+}
+
 static int host_mapping_level(struct kvm *kvm, gfn_t gfn)
 {
unsigned long page_size;
@@ -893,7 +898,7 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t 
large_gfn,
max_level = min(kvm_x86_ops->get_lpage_level(), host_level);
 
for (level = PT_DIRECTORY_LEVEL; level <= max_level; ++level)
-   if (has_wrprotected_page(vcpu, large_gfn, level))
+   if (__has_wrprotected_page(large_gfn, level, slot))
break;
 
return level - 1;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 0/4] KVM: arm/arm64: guest synchronous halt/resume

2015-10-15 Thread Christoffer Dall

On Fri, Sep 25, 2015 at 11:41:13PM +0200, Eric Auger wrote:
> This series introduces the capability to synchronously exit the guest
> and prevent it from being re-entered. This modality will be used by
> IRQ forwarding series when changing the state of the IRQ.
> 
> Former pause flag used when starting the vcpu in KVM_ARM_VCPU_POWER_OFF
> state, in PSCI calls and in KVM_SET_MP_STATE ioctl is renamed into
> power_off. A new pause flag is introduced. Both now are checked in
> kvm_arch_vcpu_runnable and in the VCPU_RUN critical section, before
> entering the vcpu.
> 
Thanks - applied,
-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [edk2] KVM: MTRR: fix memory type handling if MTRR is completely disabled

2015-10-15 Thread Janusz

W dniu 15.10.2015 o 06:19, Xiao Guangrong pisze:
>
>
>
> Well, the bug may be not in KVM. When this bug happened, i saw OVMF
> only checked 1 CPU out, there is the log from OVMF's debug input:
>
>   Flushing GCD
>   Flushing GCD
>   Flushing GCD
>   Flushing GCD
>   Flushing GCD
>   Flushing GCD
>   Flushing GCD
>   Flushing GCD
>   Flushing GCD
>   Flushing GCDs
> Detect CPU count: 1
>
> So that the startup code has been freed however the APs are still
> running,
> i think that why we saw the vCPUs executed on unexpected address.
>
> After digging into OVMF's code, i noticed that BSP CPU waits for APs
> for a fixed timer period, however, KVM recent changes require zap all
> mappings if CR0.CD is changed, that means the APs need more time to
> startup.
>
> After following changes to OVMF, the bug is completely gone on my side:
>
> --- a/UefiCpuPkg/CpuDxe/ApStartup.c
> +++ b/UefiCpuPkg/CpuDxe/ApStartup.c
> @@ -454,7 +454,9 @@ StartApsStackless (
>//
>// Wait 100 milliseconds for APs to arrive at the ApEntryPoint routine
>//
> -  MicroSecondDelay (100 * 1000);
> +  MicroSecondDelay (10 * 100 * 1000);
>
>return EFI_SUCCESS;
>  }
>
> Janusz, could you please check this instead? You can switch to your
> previous kernel to do this test.
>
>
Ok, now first time when I started VM I was able to start system
successfully. When I turned it off and started it again, it restarted my
vm at system boot couple of times. Sometimes I also get very high cpu
usage for no reason. Also, I get less fps in GTA 5 than in kernel 4.1, I
get something like 30-55, but on 4.1 I get all the time 60 fps. This is
my new log: https://bpaste.net/show/61a122ad7fe5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [edk2] KVM: MTRR: fix memory type handling if MTRR is completely disabled

2015-10-15 Thread Xiao Guangrong




On 10/15/2015 02:58 PM, Janusz wrote:

W dniu 15.10.2015 o 08:41, Xiao Guangrong pisze:



On 10/15/2015 02:19 PM, Janusz wrote:

W dniu 15.10.2015 o 06:19, Xiao Guangrong pisze:




Well, the bug may be not in KVM. When this bug happened, i saw OVMF
only checked 1 CPU out, there is the log from OVMF's debug input:

Flushing GCD
Flushing GCD
Flushing GCD
Flushing GCD
Flushing GCD
Flushing GCD
Flushing GCD
Flushing GCD
Flushing GCD
Flushing GCDs
Detect CPU count: 1

So that the startup code has been freed however the APs are still
running,
i think that why we saw the vCPUs executed on unexpected address.

After digging into OVMF's code, i noticed that BSP CPU waits for APs
for a fixed timer period, however, KVM recent changes require zap all
mappings if CR0.CD is changed, that means the APs need more time to
startup.

After following changes to OVMF, the bug is completely gone on my side:

--- a/UefiCpuPkg/CpuDxe/ApStartup.c
+++ b/UefiCpuPkg/CpuDxe/ApStartup.c
@@ -454,7 +454,9 @@ StartApsStackless (
 //
 // Wait 100 milliseconds for APs to arrive at the ApEntryPoint
routine
 //
-  MicroSecondDelay (100 * 1000);
+  MicroSecondDelay (10 * 100 * 1000);

 return EFI_SUCCESS;
   }

Janusz, could you please check this instead? You can switch to your
previous kernel to do this test.



Ok, now first time when I started VM I was able to start system
successfully. When I turned it off and started it again, it restarted my
vm at system boot couple of times. Sometimes I also get very high cpu
usage for no reason. Also, I get less fps in GTA 5 than in kernel 4.1, I
get something like 30-55, but on 4.1 I get all the time 60 fps. This is
my new log: https://bpaste.net/show/61a122ad7fe5



Just confirm: the Qemu internal error did not appear any more, right?

Yes, when I reverted your first patch, switched to -vga std from -vga
none and didn't passthrough my GPU (case when I got this internal
error), vm started without problem. I even didn't get any VM restarts
like with passthrough



Wow, it seems we have fixed the QEMU internal error now. :)

Recurrently, Paolo has reverted some MTRR patches, was your test
based on these reverted patches?

The GPU passthrough issue may be related to vfio (not sure), Alex, do
you have any idea?

Laszlo, could you please check the root case is reasonable and fix it in
OVMF if it's right?

BTW, OVMF handles #UD with no trace - nothing is killed, and no call trace
in the debug input...

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [edk2] KVM: MTRR: fix memory type handling if MTRR is completely disabled

2015-10-15 Thread Xiao Guangrong




On 10/15/2015 02:19 PM, Janusz wrote:

W dniu 15.10.2015 o 06:19, Xiao Guangrong pisze:




Well, the bug may be not in KVM. When this bug happened, i saw OVMF
only checked 1 CPU out, there is the log from OVMF's debug input:

   Flushing GCD
   Flushing GCD
   Flushing GCD
   Flushing GCD
   Flushing GCD
   Flushing GCD
   Flushing GCD
   Flushing GCD
   Flushing GCD
   Flushing GCDs
Detect CPU count: 1

So that the startup code has been freed however the APs are still
running,
i think that why we saw the vCPUs executed on unexpected address.

After digging into OVMF's code, i noticed that BSP CPU waits for APs
for a fixed timer period, however, KVM recent changes require zap all
mappings if CR0.CD is changed, that means the APs need more time to
startup.

After following changes to OVMF, the bug is completely gone on my side:

--- a/UefiCpuPkg/CpuDxe/ApStartup.c
+++ b/UefiCpuPkg/CpuDxe/ApStartup.c
@@ -454,7 +454,9 @@ StartApsStackless (
//
// Wait 100 milliseconds for APs to arrive at the ApEntryPoint routine
//
-  MicroSecondDelay (100 * 1000);
+  MicroSecondDelay (10 * 100 * 1000);

return EFI_SUCCESS;
  }

Janusz, could you please check this instead? You can switch to your
previous kernel to do this test.



Ok, now first time when I started VM I was able to start system
successfully. When I turned it off and started it again, it restarted my
vm at system boot couple of times. Sometimes I also get very high cpu
usage for no reason. Also, I get less fps in GTA 5 than in kernel 4.1, I
get something like 30-55, but on 4.1 I get all the time 60 fps. This is
my new log: https://bpaste.net/show/61a122ad7fe5



Just confirm: the Qemu internal error did not appear any more, right?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [edk2] KVM: MTRR: fix memory type handling if MTRR is completely disabled

2015-10-15 Thread Janusz

W dniu 15.10.2015 o 09:10, Xiao Guangrong pisze:
>
>
> On 10/15/2015 02:58 PM, Janusz wrote:
>> W dniu 15.10.2015 o 08:41, Xiao Guangrong pisze:
>>>
>>>
>>> On 10/15/2015 02:19 PM, Janusz wrote:
 W dniu 15.10.2015 o 06:19, Xiao Guangrong pisze:
>
>
>
> Well, the bug may be not in KVM. When this bug happened, i saw OVMF
> only checked 1 CPU out, there is the log from OVMF's debug input:
>
> Flushing GCD
> Flushing GCD
> Flushing GCD
> Flushing GCD
> Flushing GCD
> Flushing GCD
> Flushing GCD
> Flushing GCD
> Flushing GCD
> Flushing GCDs
> Detect CPU count: 1
>
> So that the startup code has been freed however the APs are still
> running,
> i think that why we saw the vCPUs executed on unexpected address.
>
> After digging into OVMF's code, i noticed that BSP CPU waits for APs
> for a fixed timer period, however, KVM recent changes require zap all
> mappings if CR0.CD is changed, that means the APs need more time to
> startup.
>
> After following changes to OVMF, the bug is completely gone on my
> side:
>
> --- a/UefiCpuPkg/CpuDxe/ApStartup.c
> +++ b/UefiCpuPkg/CpuDxe/ApStartup.c
> @@ -454,7 +454,9 @@ StartApsStackless (
>  //
>  // Wait 100 milliseconds for APs to arrive at the ApEntryPoint
> routine
>  //
> -  MicroSecondDelay (100 * 1000);
> +  MicroSecondDelay (10 * 100 * 1000);
>
>  return EFI_SUCCESS;
>}
>
> Janusz, could you please check this instead? You can switch to your
> previous kernel to do this test.
>
>
 Ok, now first time when I started VM I was able to start system
 successfully. When I turned it off and started it again, it
 restarted my
 vm at system boot couple of times. Sometimes I also get very high cpu
 usage for no reason. Also, I get less fps in GTA 5 than in kernel
 4.1, I
 get something like 30-55, but on 4.1 I get all the time 60 fps.
 This is
 my new log: https://bpaste.net/show/61a122ad7fe5

>>>
>>> Just confirm: the Qemu internal error did not appear any more, right?
>> Yes, when I reverted your first patch, switched to -vga std from -vga
>> none and didn't passthrough my GPU (case when I got this internal
>> error), vm started without problem. I even didn't get any VM restarts
>> like with passthrough
>>
>
> Wow, it seems we have fixed the QEMU internal error now. :)
>
> Recurrently, Paolo has reverted some MTRR patches, was your test
> based on these reverted patches?
>
> The GPU passthrough issue may be related to vfio (not sure), Alex, do
> you have any idea?
>
> Laszlo, could you please check the root case is reasonable and fix it in
> OVMF if it's right?
>
> BTW, OVMF handles #UD with no trace - nothing is killed, and no call
> trace
> in the debug input...
>
Yes, reverted MTRR code is already in kernel I use - 4.3-r5+
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] VFIO: platform: AMD xgbe reset module

2015-10-15 Thread Eric Auger

Hi Arnd,
On 10/14/2015 05:38 PM, Arnd Bergmann wrote:
> On Wednesday 14 October 2015 15:33:12 Eric Auger wrote:
>> --- a/drivers/vfio/platform/vfio_platform_common.c
>> +++ b/drivers/vfio/platform/vfio_platform_common.c
>> @@ -31,6 +31,11 @@ static const struct vfio_platform_reset_combo 
>> reset_lookup_table[] = {
>> .reset_function_name = "vfio_platform_calxedaxgmac_reset",
>> .module_name = "vfio-platform-calxedaxgmac",
>> },
>> +   {
>> +   .compat = "amd,xgbe-seattle-v1a",
>> +   .reset_function_name = "vfio_platform_amdxgbe_reset",
>> +   .module_name = "vfio-platform-amdxgbe",
>> +   },
>>  };
>>  
>>  static void vfio_platform_get_reset(struct vfio_platform_device *vdev,
>>
> 
> This is causing build errors for me when CONFIG_MODULES is disabled.
Sorry about that and thanks for reporting the issue
> 
> Could this please be restructured so vfio_platform_get_reset does
> not attempt to call __symbol_get() but instead has the drivers
> register themselves properly to a subsystem?
OK

Could you elaborate about "has the drivers register themselves properly
to a subsystem".

My first proposal when coping with this problematic of being able to add
reset plugins to the vfio-platform driver was to create new drivers per
device requiring reset. but this was considered painful for end-users,
who needed to be aware of the right driver to bind - and I think that
makes sense - (https://lkml.org/lkml/2015/4/17/568) .

A naive question I dare to ask, wouldn't it be acceptable to make
vfio_platform depend on CONFIG_MODULES? Don't we disable modules for
security purpose? In that context would we use VFIO?

Best Regards

Eric
> 
> I don't see any way this could be fixed otherwise. The problem
> of course showed up with calxedaxgmac already, but I'd prefer not
> to see anything added there until the common code has been improved.
> 
>   Arnd
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [edk2] KVM: MTRR: fix memory type handling if MTRR is completely disabled

2015-10-15 Thread Janusz

W dniu 15.10.2015 o 08:41, Xiao Guangrong pisze:
>
>
> On 10/15/2015 02:19 PM, Janusz wrote:
>> W dniu 15.10.2015 o 06:19, Xiao Guangrong pisze:
>>>
>>>
>>>
>>> Well, the bug may be not in KVM. When this bug happened, i saw OVMF
>>> only checked 1 CPU out, there is the log from OVMF's debug input:
>>>
>>>Flushing GCD
>>>Flushing GCD
>>>Flushing GCD
>>>Flushing GCD
>>>Flushing GCD
>>>Flushing GCD
>>>Flushing GCD
>>>Flushing GCD
>>>Flushing GCD
>>>Flushing GCDs
>>> Detect CPU count: 1
>>>
>>> So that the startup code has been freed however the APs are still
>>> running,
>>> i think that why we saw the vCPUs executed on unexpected address.
>>>
>>> After digging into OVMF's code, i noticed that BSP CPU waits for APs
>>> for a fixed timer period, however, KVM recent changes require zap all
>>> mappings if CR0.CD is changed, that means the APs need more time to
>>> startup.
>>>
>>> After following changes to OVMF, the bug is completely gone on my side:
>>>
>>> --- a/UefiCpuPkg/CpuDxe/ApStartup.c
>>> +++ b/UefiCpuPkg/CpuDxe/ApStartup.c
>>> @@ -454,7 +454,9 @@ StartApsStackless (
>>> //
>>> // Wait 100 milliseconds for APs to arrive at the ApEntryPoint
>>> routine
>>> //
>>> -  MicroSecondDelay (100 * 1000);
>>> +  MicroSecondDelay (10 * 100 * 1000);
>>>
>>> return EFI_SUCCESS;
>>>   }
>>>
>>> Janusz, could you please check this instead? You can switch to your
>>> previous kernel to do this test.
>>>
>>>
>> Ok, now first time when I started VM I was able to start system
>> successfully. When I turned it off and started it again, it restarted my
>> vm at system boot couple of times. Sometimes I also get very high cpu
>> usage for no reason. Also, I get less fps in GTA 5 than in kernel 4.1, I
>> get something like 30-55, but on 4.1 I get all the time 60 fps. This is
>> my new log: https://bpaste.net/show/61a122ad7fe5
>>
>
> Just confirm: the Qemu internal error did not appear any more, right?
Yes, when I reverted your first patch, switched to -vga std from -vga
none and didn't passthrough my GPU (case when I got this internal
error), vm started without problem. I even didn't get any VM restarts
like with passthrough
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: VMX: enable LBR virtualization

2015-10-15 Thread Jian Zhou




On 2015/10/14 19:30, Paolo Bonzini wrote:



On 14/10/2015 13:26, Jian Zhou wrote:

On 12/10/2015 20:44, Paolo Bonzini wrote:

In addition, the MSR numbers may differ between the guest and the host,
because it is possible to emulate e.g. a Core CPU on a Core 2 CPU.  So I
recommend against using the atomic switch mechanism for the from/to MSRs.


   The vLBR feature depends on vPMU, and to enable vPMU, it needs to
   specify the "cpu mode" in the guest XML as host-passthrough. I think
   the MSR numbers between the guest and the host are the same in this
   senario.


Does it depend on vPMU _for Linux guests_ or in general?  My impression
is that LBR can be used by the guest independent of the PMU.


  I think only for Linux guests.

  I googled how to enable LBR on other guests(except Linux guests),
  e.g. Windows, and got no developer manuals about it.

  Here is an article about it:
  http://www.codeproject.com/Articles/517466/Last-branch-records-
  and-branch-tracing
  it says:
  "bit 8 of DR7 represents bit 0 of DebugCtl. This is the LBR bit."

  Intel developer manual vol 3B introduced DR7(Debug Control Register)
  and bit 8 of it on Section 17.2.4:
  "LE and GE (local and global exact breakpoint enable) flags (bits 8,
  9) — When set, these flags cause the processor to detect the exact
  instruction that caused a data breakpoint condition. For backward and
  forward compatibility with other Intel processors, we recommend that
  the LE and GE flags be set to 1 if exact breakpoints are required."

  But for now, I don't know how to test bit 8 of DR7 on Windows.

  Regards,

  Jian







--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] VFIO: platform: AMD xgbe reset module

2015-10-15 Thread Arnd Bergmann

On Thursday 15 October 2015 17:03:21 Christoffer Dall wrote:
> On Thu, Oct 15, 2015 at 04:55:13PM +0200, Arnd Bergmann wrote:
> > On Thursday 15 October 2015 16:46:09 Eric Auger wrote:
> > > > 
> > > > This is where we'd need a little more changes for this approach. Instead
> > > > of unbinding the device from its driver, the idea would be that the
> > > > driver remains bound as far as the driver model is concerned, but
> > > > it would be in a quiescent state where no other subsystem interacts with
> > > > it (i.e. it gets unregistered from networking core or whichever it 
> > > > uses).
> > > 
> > > Currently we use the same mechanism as for PCI, ie. unbind the native
> > > driver and then bind VFIO platform driver in its place. Don't you think
> > > changing this may be a pain for user-space tools that are designed to
> > > work that way for PCI?
> > >
> > > My personal preference would be to start with your first proposal since
> > > it looks (to me) less complex and "unknown" that the 2d approach.
> > 
> > We certainly can't easily change from one approach to the other without
> > breaking user expectations, so the decision needs to be made carefully.
> > 
> > The main observation here is that platform devices are unlike PCI in this
> > regard because they need extra per-device code. I have argued in the
> > past that we should not reuse the "VFIO" name here because it's actually
> > something else.  
> 
> I've adjusted to consider VFIO a general purpose framework for mapping
> device resources into userspace/VMs, and there are certainly a lot of
> commonality with both PCI, platform, and potentially other devices for
> that to make sense.
> 
> 
> > On the other hand, there are a lot of commonalities,
> > we just have to make sure we don't try to force the code into one model
> > that doesn't really work just to make it look more like PCI VFIO.
> > 
> 
> But given that we now have code for platform device passthrough that
> works in both QEMU and the kernel side and is actually useful for
> people, is there a clear technical advantage to go back and rework thaat
> at this point?
> 
> Don't get me wrong, I like the idea of having a single driver bound to a
> platform device, and then that's it, but it just feels like that
> discussion doesn't necessarily belong in the context of a patch that
> 'just' seeks to add reset functionality for a specific device for VFIO?

Ah, this is for qemu/kvm? If there is already upstream qemu code that
finds it easier to use this approach, it's certainly more logical to
deal with it in my first approach than the second.

I was thinking of ODP as the primary user, and that wouldn't need
the interface to be consistent with PCI as much, because the code
is inherently device specific there anyway.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 27/32] nvdimm: support DSM_CMD_IMPLEMENTED function

2015-10-15 Thread Stefan Hajnoczi

On Wed, Oct 14, 2015 at 10:50:40PM +0800, Xiao Guangrong wrote:
> On 10/14/2015 05:40 PM, Stefan Hajnoczi wrote:
> >On Sun, Oct 11, 2015 at 11:52:59AM +0800, Xiao Guangrong wrote:
> >>+out = (dsm_out *)in;
> >>+
> >>+revision = in->arg1;
> >>+function = in->arg2;
> >>+handle = in->handle;
> >>+le32_to_cpus();
> >>+le32_to_cpus();
> >>+le32_to_cpus();
> >>+
> >>+nvdebug("UUID " UUID_FMT ".\n", in->arg0[0], in->arg0[1], in->arg0[2],
> >>+in->arg0[3], in->arg0[4], in->arg0[5], in->arg0[6],
> >>+in->arg0[7], in->arg0[8], in->arg0[9], in->arg0[10],
> >>+in->arg0[11], in->arg0[12], in->arg0[13], in->arg0[14],
> >>+in->arg0[15]);
> >>+nvdebug("Revision %#x Function %#x Handler %#x.\n", revision, function,
> >>+handle);
> >>+
> >>+if (revision != DSM_REVISION) {
> >>+nvdebug("Revision %#x is not supported, expect %#x.\n",
> >>+revision, DSM_REVISION);
> >>+goto exit;
> >>+}
> >>+
> >>+if (!handle) {
> >>+if (!dsm_is_root_uuid(in->arg0)) {
> >
> >Please don't dereference 'in' or pass it to other functions.  Avoid race
> >conditions with guest vcpus by coping in the entire dsm_in struct.
> >
> >This is like a system call - the kernel cannot trust userspace memory
> >and must copy in before accessing data.  The same rules apply.
> >
> 
> It's little different for QEMU:
> - the memory address is always valid to QEMU, it's not always true for Kernel
>   due to context-switch
> 
> - we have checked the header before use it's data, for example, when we get
>   data from GET_NAMESPACE_DATA, we have got the @offset and @length from the
>   memory, then copy memory based on these values, that means the userspace
>   has no chance to cause buffer overflow by increasing these values at 
> runtime.
> 
>   The scenario for our case is simple but Kernel is difficult to do
>   check_all_before_use as many paths may be involved.
> 
> - guest changes some data is okay, the worst case is that the label data is
>   corrupted. This is caused by guest itself. Kernel also supports this kind
>   of behaviour, e,g. network TX zero copy, the userspace page is being
>   transferred while userspace can still access it.
> 
> - it's 4K size on x86, full copy wastes CPU time too much.

This isn't performance-critical code and I don't want to review it
keeping the race conditions in mind the whole time.  Also, if the code
is modified in the future, the chance of introducing a race is high.

I see this as premature optimization, please just copy in data.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] KVM: x86: MMU: Remove mapping_level_dirty_bitmap()

2015-10-15 Thread Paolo Bonzini



On 15/10/2015 12:43, Takuya Yoshikawa wrote:
> +static inline bool memslot_invalid(struct kvm_memory_slot *slot)

Can you make this function memslot_valid_for_gpte(struct kvm_memory_slot
*slot, bool no_dirty_log), and have it return

slot && !(slot->flags & KVM_MEMSLOT_INVALID) &&
  (!no_dirty_log || !slot->dirty_bitmap)

?  If gfn_to_memslot_dirty_bitmap and mapping_level call the same
function, it helps highlighting the similarity between them.  Your
optimization loses that similarity in the name, but I think we can bring
it back somehow.

Otherwise, the patches are great.  Thanks!

Paolo


> +{
> + if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
> + return true;
> +
> + return false;
> +}
> +
>  static struct kvm_memory_slot *
>  gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t gfn,
>   bool no_dirty_log)
> @@ -858,25 +866,22 @@ gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, 
> gfn_t gfn,
>   struct kvm_memory_slot *slot;
>  
>   slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> - if (!slot || slot->flags & KVM_MEMSLOT_INVALID ||
> -   (no_dirty_log && slot->dirty_bitmap))
> + if (memslot_invalid(slot) || (no_dirty_log && slot->dirty_bitmap))
>   slot = NULL;
>  
>   return slot;
>  }
>  
> -static bool mapping_level_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t 
> large_gfn)
> -{
> - return !gfn_to_memslot_dirty_bitmap(vcpu, large_gfn, true);
> -}
> -
>  static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn,
>bool *force_pt_level)
>  {
>   int host_level, level, max_level;
> + struct kvm_memory_slot *slot;
> +
> + slot = kvm_vcpu_gfn_to_memslot(vcpu, large_gfn);
>  
>   if (likely(!*force_pt_level))
> - *force_pt_level = mapping_level_dirty_bitmap(vcpu, large_gfn);
> + *force_pt_level = memslot_invalid(slot) || slot->dirty_bitmap;
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 27/32] nvdimm: support DSM_CMD_IMPLEMENTED function

2015-10-15 Thread Stefan Hajnoczi

On Wed, Oct 14, 2015 at 10:52:15PM +0800, Xiao Guangrong wrote:
> On 10/14/2015 05:41 PM, Stefan Hajnoczi wrote:
> >On Sun, Oct 11, 2015 at 11:52:59AM +0800, Xiao Guangrong wrote:
> >>+out->len = sizeof(out->status);
> >
> >out->len is uint16_t, it needs cpu_to_le16().  There may be other
> >instances in this patch series.
> >
> 
> out->len is internally used only which is invisible to guest OS, i,e,
> we write this value and read this value by ourself. I think it is
> okay.

'out' points to guest memory.  Guest memory is untrusted so QEMU cannot
stash values there - an evil guest could modify them.

Please put the len variable on the QEMU stack or heap where the guest
cannot access it.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [edk2] KVM: MTRR: fix memory type handling if MTRR is completely disabled

2015-10-15 Thread Laszlo Ersek

CC'ing Jordan and Chen Fan.

On 10/15/15 09:10, Xiao Guangrong wrote:
> 
> 
> On 10/15/2015 02:58 PM, Janusz wrote:
>> W dniu 15.10.2015 o 08:41, Xiao Guangrong pisze:
>>>
>>>
>>> On 10/15/2015 02:19 PM, Janusz wrote:
 W dniu 15.10.2015 o 06:19, Xiao Guangrong pisze:
>
>
>
> Well, the bug may be not in KVM. When this bug happened, i saw OVMF
> only checked 1 CPU out, there is the log from OVMF's debug input:
>
> Flushing GCD
> Flushing GCD
> Flushing GCD
> Flushing GCD
> Flushing GCD
> Flushing GCD
> Flushing GCD
> Flushing GCD
> Flushing GCD
> Flushing GCDs
> Detect CPU count: 1
>
> So that the startup code has been freed however the APs are still
> running,
> i think that why we saw the vCPUs executed on unexpected address.
>
> After digging into OVMF's code, i noticed that BSP CPU waits for APs
> for a fixed timer period, however, KVM recent changes require zap all
> mappings if CR0.CD is changed, that means the APs need more time to
> startup.
>
> After following changes to OVMF, the bug is completely gone on my
> side:
>
> --- a/UefiCpuPkg/CpuDxe/ApStartup.c
> +++ b/UefiCpuPkg/CpuDxe/ApStartup.c
> @@ -454,7 +454,9 @@ StartApsStackless (
>  //
>  // Wait 100 milliseconds for APs to arrive at the ApEntryPoint
> routine
>  //
> -  MicroSecondDelay (100 * 1000);
> +  MicroSecondDelay (10 * 100 * 1000);
>
>  return EFI_SUCCESS;
>}
>
> Janusz, could you please check this instead? You can switch to your
> previous kernel to do this test.
>
>
 Ok, now first time when I started VM I was able to start system
 successfully. When I turned it off and started it again, it
 restarted my
 vm at system boot couple of times. Sometimes I also get very high cpu
 usage for no reason. Also, I get less fps in GTA 5 than in kernel
 4.1, I
 get something like 30-55, but on 4.1 I get all the time 60 fps. This is
 my new log: https://bpaste.net/show/61a122ad7fe5

>>>
>>> Just confirm: the Qemu internal error did not appear any more, right?
>> Yes, when I reverted your first patch, switched to -vga std from -vga
>> none and didn't passthrough my GPU (case when I got this internal
>> error), vm started without problem. I even didn't get any VM restarts
>> like with passthrough
>>
> 
> Wow, it seems we have fixed the QEMU internal error now. :)
> 
> Recurrently, Paolo has reverted some MTRR patches, was your test
> based on these reverted patches?
> 
> The GPU passthrough issue may be related to vfio (not sure), Alex, do
> you have any idea?
> 
> Laszlo, could you please check the root case is reasonable and fix it in
> OVMF if it's right?

The code that you have found is in edk2's EFI_MP_SERVICES_PROTOCOL
implementation -- more closely, its initial CPU counter code --, from
edk2 git commit 533263ee5a7f. It is not specific to OVMF -- it is
generic edk2 code for Intel processors. (I'm CC'ing Jordan and Chen Fan
because they authored the patch in question.)

If VCPUs need more time to rendezvous than written in the code, on
recent KVM, then I think we should introduce a new FixedPCD in
UefiCpuPkg (practically: a compile time constant) for the timeout. Which
is not hard to do.

However, we'll need two things:
- an idea about the concrete rendezvous timeout to set, from OvmfPkg

- a *detailed* explanation / elaboration on your words:

  "KVM recent changes require zap all mappings if CR0.CD is changed,
  that means the APs need more time to startup"

  Preferably with references to Linux kernel commits and the Intel SDM,
  so that n00bs like me can get a fleeting idea. Do you mean that with
  caching disabled, the APs execute their rendezvous code (from memory)
  more slowly?

> BTW, OVMF handles #UD with no trace - nothing is killed, and no call trace
> in the debug input...

There *is* a trace (of any unexpected exception -- at least for the
BSP), but unfortunately its location is not intuitive.

The exception handler that is built into OVMF
("UefiCpuPkg/Library/CpuExceptionHandlerLib") is again generic edk2
code, and it prints the trace directly to the serial port, regardless of
the fact that OVMF's DebugLib instance logs explicit DEBUGs to the QEMU
debug port. (The latter can be directed to the serial port as well, if
you build OVMF with -D DEBUG_ON_SERIAL_PORT, but this is not relevant here.)

If you reproduce the issue while looking at the (virtual) serial port of
the guest, I trust you will get a register dump.

Thanks!
Laszlo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] VFIO: platform: AMD xgbe reset module

2015-10-15 Thread Arnd Bergmann

On Thursday 15 October 2015 14:12:28 Christoffer Dall wrote:
> > 
> > enum vfio_platform_op {
> >   VFIO_PLATFORM_BIND,
> >   VFIO_PLATFORM_UNBIND,
> >   VFIO_PLATFORM_RESET,
> > };
> > 
> > struct platform_driver {
> > int (*probe)(struct platform_device *);
> > int (*remove)(struct platform_device *);
> >   ...
> >   int (*vfio_manage)(struct platform_device *, enum vfio_platform_op);
> > struct device_driver driver;
> > };
> > 
> > This would integrate much more closely into the platform driver framework,
> > just like the regular vfio driver integrates into the PCI framework.
> > Unlike PCI however, you can't just use the generic driver framework to
> > unbind the driver, because you still need device specific code.
> > 
> Thanks for these suggestions, really helpful.
> 
> What I don't understand in the latter example is how VFIO knows which
> struct platform_driver to interact with?

This would assume that the driver remains bound to the device, so VFIO
gets a pointer to the device from somewhere (as it does today) and then
follows the dev->driver pointer to get to the platform_driver.
 
> Also, just so I'm sure I understand correctly, VFIO_PLATFORM_UNBIND is
> then called by VFIO before the VFIO driver unbinds from the device
> (unbinding the platform driver from the device being a completely
> separate thing)?

This is where we'd need a little more changes for this approach. Instead
of unbinding the device from its driver, the idea would be that the
driver remains bound as far as the driver model is concerned, but
it would be in a quiescent state where no other subsystem interacts with
it (i.e. it gets unregistered from networking core or whichever it uses).

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] VFIO: platform: AMD xgbe reset module

2015-10-15 Thread Eric Auger

Hi Arnd,
On 10/15/2015 03:59 PM, Arnd Bergmann wrote:
> On Thursday 15 October 2015 14:12:28 Christoffer Dall wrote:
>>>
>>> enum vfio_platform_op {
>>>   VFIO_PLATFORM_BIND,
>>>   VFIO_PLATFORM_UNBIND,
>>>   VFIO_PLATFORM_RESET,
>>> };
>>>
>>> struct platform_driver {
>>> int (*probe)(struct platform_device *);
>>> int (*remove)(struct platform_device *);
>>>   ...
>>>   int (*vfio_manage)(struct platform_device *, enum vfio_platform_op);
>>> struct device_driver driver;
>>> };
>>>
>>> This would integrate much more closely into the platform driver framework,
>>> just like the regular vfio driver integrates into the PCI framework.
>>> Unlike PCI however, you can't just use the generic driver framework to
>>> unbind the driver, because you still need device specific code.
>>>
>> Thanks for these suggestions, really helpful.
>>
>> What I don't understand in the latter example is how VFIO knows which
>> struct platform_driver to interact with?
> 
> This would assume that the driver remains bound to the device, so VFIO
> gets a pointer to the device from somewhere (as it does today) and then
> follows the dev->driver pointer to get to the platform_driver.

>  
>> Also, just so I'm sure I understand correctly, VFIO_PLATFORM_UNBIND is
>> then called by VFIO before the VFIO driver unbinds from the device
>> (unbinding the platform driver from the device being a completely
>> separate thing)?
> 
> This is where we'd need a little more changes for this approach. Instead
> of unbinding the device from its driver, the idea would be that the
> driver remains bound as far as the driver model is concerned, but
> it would be in a quiescent state where no other subsystem interacts with
> it (i.e. it gets unregistered from networking core or whichever it uses).

Currently we use the same mechanism as for PCI, ie. unbind the native
driver and then bind VFIO platform driver in its place. Don't you think
changing this may be a pain for user-space tools that are designed to
work that way for PCI?

My personal preference would be to start with your first proposal since
it looks (to me) less complex and "unknown" that the 2d approach.

Let's wait for Alex opinion too...

Thanks again

Eric




> 
>   Arnd
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] VFIO: platform: AMD xgbe reset module

2015-10-15 Thread Arnd Bergmann

On Thursday 15 October 2015 16:46:09 Eric Auger wrote:
> > 
> > This is where we'd need a little more changes for this approach. Instead
> > of unbinding the device from its driver, the idea would be that the
> > driver remains bound as far as the driver model is concerned, but
> > it would be in a quiescent state where no other subsystem interacts with
> > it (i.e. it gets unregistered from networking core or whichever it uses).
> 
> Currently we use the same mechanism as for PCI, ie. unbind the native
> driver and then bind VFIO platform driver in its place. Don't you think
> changing this may be a pain for user-space tools that are designed to
> work that way for PCI?
>
> My personal preference would be to start with your first proposal since
> it looks (to me) less complex and "unknown" that the 2d approach.

We certainly can't easily change from one approach to the other without
breaking user expectations, so the decision needs to be made carefully.

The main observation here is that platform devices are unlike PCI in this
regard because they need extra per-device code. I have argued in the
past that we should not reuse the "VFIO" name here because it's actually
something else.  On the other hand, there are a lot of commonalities,
we just have to make sure we don't try to force the code into one model
that doesn't really work just to make it look more like PCI VFIO.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] VFIO: platform: AMD xgbe reset module

2015-10-15 Thread Arnd Bergmann

On Thursday 15 October 2015 16:20:46 Eric Auger wrote:
> On 10/15/2015 02:12 PM, Christoffer Dall wrote:
> > On Thu, Oct 15, 2015 at 01:21:55PM +0200, Arnd Bergmann wrote:
> >> On Thursday 15 October 2015 10:08:02 Eric Auger wrote:
> >>> On 10/14/2015 05:38 PM, Arnd Bergmann wrote:
>  On Wednesday 14 October 2015 15:33:12 Eric Auger wrote:
> >> A possible solution could be something inside the xgbe driver like
> >>
> >>
> >> static void xgbe_init_module(void)
> >> {
> >>int ret = 0;
> >>
> >>if (IS_ENABLED(CONFIG_AMD_XGBE_ETHERNET)
> >>ret = platform_driver_register(_driver);
> >>if (ret)
> >>return ret;
> >>
> >>if (IS_ENABLED(CONFIG_VFIO_PLATFORM))
> >>ret = vfio_platform_register_reset(_of_match, 
> >> xgbe_platform_reset);
> >>
> >>return ret; 
> >> }
> >>
> >> This way you have exactly one driver module that gets loaded for the
> >> device and you can use it either with the platform_driver or through
> >> vfio.
> If I understand it correctly you still need 2 loaded modules (VFIO
> driver & XGBE driver which implements the reset function) or am I
> missing something?

That is correct, yes.
 
> I had a similar mechanism of registration in my PATCH v1 but I did the
> registration from the reset module itself instead of in the native
> driver, as you suggest here.

Right. The main difference is that you don't have two modules fighting
over the same device with the approach here.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] VFIO: platform: AMD xgbe reset module

2015-10-15 Thread Christoffer Dall

On Thu, Oct 15, 2015 at 04:55:13PM +0200, Arnd Bergmann wrote:
> On Thursday 15 October 2015 16:46:09 Eric Auger wrote:
> > > 
> > > This is where we'd need a little more changes for this approach. Instead
> > > of unbinding the device from its driver, the idea would be that the
> > > driver remains bound as far as the driver model is concerned, but
> > > it would be in a quiescent state where no other subsystem interacts with
> > > it (i.e. it gets unregistered from networking core or whichever it uses).
> > 
> > Currently we use the same mechanism as for PCI, ie. unbind the native
> > driver and then bind VFIO platform driver in its place. Don't you think
> > changing this may be a pain for user-space tools that are designed to
> > work that way for PCI?
> >
> > My personal preference would be to start with your first proposal since
> > it looks (to me) less complex and "unknown" that the 2d approach.
> 
> We certainly can't easily change from one approach to the other without
> breaking user expectations, so the decision needs to be made carefully.
> 
> The main observation here is that platform devices are unlike PCI in this
> regard because they need extra per-device code. I have argued in the
> past that we should not reuse the "VFIO" name here because it's actually
> something else.  

I've adjusted to consider VFIO a general purpose framework for mapping
device resources into userspace/VMs, and there are certainly a lot of
commonality with both PCI, platform, and potentially other devices for
that to make sense.


> On the other hand, there are a lot of commonalities,
> we just have to make sure we don't try to force the code into one model
> that doesn't really work just to make it look more like PCI VFIO.
> 

But given that we now have code for platform device passthrough that
works in both QEMU and the kernel side and is actually useful for
people, is there a clear technical advantage to go back and rework thaat
at this point?

Don't get me wrong, I like the idea of having a single driver bound to a
platform device, and then that's it, but it just feels like that
discussion doesn't necessarily belong in the context of a patch that
'just' seeks to add reset functionality for a specific device for VFIO?

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: VMX: enable LBR virtualization

2015-10-15 Thread Paolo Bonzini



On 15/10/2015 15:51, Jian Zhou wrote:
> 
> 
> On 2015/10/14 19:30, Paolo Bonzini wrote:
>>
>>
>> On 14/10/2015 13:26, Jian Zhou wrote:
>>> On 12/10/2015 20:44, Paolo Bonzini wrote:
 In addition, the MSR numbers may differ between the guest and the host,
 because it is possible to emulate e.g. a Core CPU on a Core 2 CPU. 
 So I
 recommend against using the atomic switch mechanism for the from/to
 MSRs.
>>>
>>>The vLBR feature depends on vPMU, and to enable vPMU, it needs to
>>>specify the "cpu mode" in the guest XML as host-passthrough. I think
>>>the MSR numbers between the guest and the host are the same in this
>>>senario.
>>
>> Does it depend on vPMU _for Linux guests_ or in general?  My impression
>> is that LBR can be used by the guest independent of the PMU.
> 
>   I think only for Linux guests.
> 
>   I googled how to enable LBR on other guests(except Linux guests),
>   e.g. Windows, and got no developer manuals about it.
> 
>   Here is an article about it:
>   http://www.codeproject.com/Articles/517466/Last-branch-records-
>   and-branch-tracing
>   it says:
>   "bit 8 of DR7 represents bit 0 of DebugCtl. This is the LBR bit."

Don't worry about the operating system in the guest: you are just
emulating a processor feature, you do not care about anything except
what is written in the Intel SDM.

You can use kvm-unit-tests
(https://git.kernel.org/cgit/virt/kvm/kvm-unit-tests.git/) to write a
test for your feature.  There are existing tests for debugging features.

Thanks,

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] VFIO: platform: AMD xgbe reset module

2015-10-15 Thread Eric Auger

Hi Arnd,
On 10/15/2015 02:12 PM, Christoffer Dall wrote:
> On Thu, Oct 15, 2015 at 01:21:55PM +0200, Arnd Bergmann wrote:
>> On Thursday 15 October 2015 10:08:02 Eric Auger wrote:
>>> Hi Arnd,
>>> On 10/14/2015 05:38 PM, Arnd Bergmann wrote:
 On Wednesday 14 October 2015 15:33:12 Eric Auger wrote:
> --- a/drivers/vfio/platform/vfio_platform_common.c
> +++ b/drivers/vfio/platform/vfio_platform_common.c
> @@ -31,6 +31,11 @@ static const struct vfio_platform_reset_combo 
> reset_lookup_table[] = {
> .reset_function_name = "vfio_platform_calxedaxgmac_reset",
> .module_name = "vfio-platform-calxedaxgmac",
> },
> +   {
> +   .compat = "amd,xgbe-seattle-v1a",
> +   .reset_function_name = "vfio_platform_amdxgbe_reset",
> +   .module_name = "vfio-platform-amdxgbe",
> +   },
>  };
>  
>  static void vfio_platform_get_reset(struct vfio_platform_device *vdev,
>

 This is causing build errors for me when CONFIG_MODULES is disabled.
>>> Sorry about that and thanks for reporting the issue

 Could this please be restructured so vfio_platform_get_reset does
 not attempt to call __symbol_get() but instead has the drivers
 register themselves properly to a subsystem?
>>> OK
>>>
>>> Could you elaborate about "has the drivers register themselves properly
>>> to a subsystem".
>>>
>>> My first proposal when coping with this problematic of being able to add
>>> reset plugins to the vfio-platform driver was to create new drivers per
>>> device requiring reset. but this was considered painful for end-users,
>>> who needed to be aware of the right driver to bind - and I think that
>>> makes sense - (https://lkml.org/lkml/2015/4/17/568) .
>>
>> Having multiple drivers indeed sucks, but your current approach isn't
>> that much better, as you still have two modules that are used to driver
>> the same hardware.
>>
>> I would expect that the same driver that is used for the normal
>> operation and that it calls a function to register itself to vfio
>> by passing a structure with the device and reset function pointer.
>>
>>> A naive question I dare to ask, wouldn't it be acceptable to make
>>> vfio_platform depend on CONFIG_MODULES? Don't we disable modules for
>>> security purpose? In that context would we use VFIO?
>>
>> I think a lot of embedded systems turn off modules to save a little
>> memory, speed up boot time and simplify their user space.
>>
>> Aside from that, the current method is highly unusual and looks a bit
>> fragile to me, as you are relying on internals of the module loader
>> code. It's also a layering violation as the generic code needs to be
>> patched for each device specific module that is added, and we try
>> to avoid that.
Many thanks for taking the time to write this down
>>
>> A possible solution could be something inside the xgbe driver like
>>
>>
>> static void xgbe_init_module(void)
>> {
>>  int ret = 0;
>>
>>  if (IS_ENABLED(CONFIG_AMD_XGBE_ETHERNET)
>>  ret = platform_driver_register(_driver);
>>  if (ret)
>>  return ret;
>>
>>  if (IS_ENABLED(CONFIG_VFIO_PLATFORM))
>>  ret = vfio_platform_register_reset(_of_match, 
>> xgbe_platform_reset);
>>
>>  return ret; 
>> }
>>
>> This way you have exactly one driver module that gets loaded for the
>> device and you can use it either with the platform_driver or through
>> vfio.
If I understand it correctly you still need 2 loaded modules (VFIO
driver & XGBE driver which implements the reset function) or am I
missing something?

I had a similar mechanism of registration in my PATCH v1 but I did the
registration from the reset module itself instead of in the native
driver, as you suggest here.

Best Regards

Eric

>>
>> A nicer way that would be a little more work would be to integrate
>> the reset infrastructure into 'struct platform_driver' framework,
>> by adding another callback to the it for doing the interaction with
>> vfio, something like
>>
>> enum vfio_platform_op {
>>  VFIO_PLATFORM_BIND,
>>  VFIO_PLATFORM_UNBIND,
>>  VFIO_PLATFORM_RESET,
>> };
>>
>> struct platform_driver {
>> int (*probe)(struct platform_device *);
>> int (*remove)(struct platform_device *);
>>  ...
>>  int (*vfio_manage)(struct platform_device *, enum vfio_platform_op);
>> struct device_driver driver;
>> };
>>
>> This would integrate much more closely into the platform driver framework,
>> just like the regular vfio driver integrates into the PCI framework.
>> Unlike PCI however, you can't just use the generic driver framework to
>> unbind the driver, because you still need device specific code.
>>
> Thanks for these suggestions, really helpful.
> 
> What I don't understand in the latter example is how VFIO knows which
> struct platform_driver to interact with?
> 
> Also, just so I'm sure I understand

Re: [PATCH v3 27/32] nvdimm: support DSM_CMD_IMPLEMENTED function

2015-10-15 Thread Xiao Guangrong




On 10/15/2015 11:07 PM, Stefan Hajnoczi wrote:

On Wed, Oct 14, 2015 at 10:50:40PM +0800, Xiao Guangrong wrote:

On 10/14/2015 05:40 PM, Stefan Hajnoczi wrote:

On Sun, Oct 11, 2015 at 11:52:59AM +0800, Xiao Guangrong wrote:

+out = (dsm_out *)in;
+
+revision = in->arg1;
+function = in->arg2;
+handle = in->handle;
+le32_to_cpus();
+le32_to_cpus();
+le32_to_cpus();
+
+nvdebug("UUID " UUID_FMT ".\n", in->arg0[0], in->arg0[1], in->arg0[2],
+in->arg0[3], in->arg0[4], in->arg0[5], in->arg0[6],
+in->arg0[7], in->arg0[8], in->arg0[9], in->arg0[10],
+in->arg0[11], in->arg0[12], in->arg0[13], in->arg0[14],
+in->arg0[15]);
+nvdebug("Revision %#x Function %#x Handler %#x.\n", revision, function,
+handle);
+
+if (revision != DSM_REVISION) {
+nvdebug("Revision %#x is not supported, expect %#x.\n",
+revision, DSM_REVISION);
+goto exit;
+}
+
+if (!handle) {
+if (!dsm_is_root_uuid(in->arg0)) {


Please don't dereference 'in' or pass it to other functions.  Avoid race
conditions with guest vcpus by coping in the entire dsm_in struct.

This is like a system call - the kernel cannot trust userspace memory
and must copy in before accessing data.  The same rules apply.



It's little different for QEMU:
- the memory address is always valid to QEMU, it's not always true for Kernel
   due to context-switch

- we have checked the header before use it's data, for example, when we get
   data from GET_NAMESPACE_DATA, we have got the @offset and @length from the
   memory, then copy memory based on these values, that means the userspace
   has no chance to cause buffer overflow by increasing these values at runtime.

   The scenario for our case is simple but Kernel is difficult to do
   check_all_before_use as many paths may be involved.

- guest changes some data is okay, the worst case is that the label data is
   corrupted. This is caused by guest itself. Kernel also supports this kind
   of behaviour, e,g. network TX zero copy, the userspace page is being
   transferred while userspace can still access it.

- it's 4K size on x86, full copy wastes CPU time too much.


This isn't performance-critical code and I don't want to review it
keeping the race conditions in mind the whole time.  Also, if the code
is modified in the future, the chance of introducing a race is high.

I see this as premature optimization, please just copy in data.



No strong opinion on it... will do it following your idea.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH v9 17/18] KVM: Update Posted-Interrupts Descriptor when vCPU is blocked

2015-10-15 Thread Wu, Feng



> -Original Message-
> From: Paolo Bonzini [mailto:paolo.bonz...@gmail.com] On Behalf Of Paolo
> Bonzini
> Sent: Friday, October 16, 2015 2:13 AM
> To: David Matlack ; Wu, Feng 
> Cc: alex.william...@redhat.com; Joerg Roedel ; Marcelo
> Tosatti ; eric.au...@linaro.org; kvm list
> ; io...@lists.linux-foundation.org; linux-
> ker...@vger.kernel.org
> Subject: Re: [PATCH v9 17/18] KVM: Update Posted-Interrupts Descriptor when
> vCPU is blocked
> 
> 
> 
> On 15/10/2015 19:39, David Matlack wrote:
> > But after spending more time reading the source code this morning I
> > found that kvm_vcpu_check_block() eventually calls into
> > vmx_sync_pir_to_irr(), which copies PIR to IRR and clears ON. And then
> > apic_find_highest_irr() detects the pending posted interrupt.
> 
> Right.  And related to this, Feng, can you check if this is still
> necessary on kvm/queue:
> 
> @@ -6518,6 +6523,20 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>   kvm_vcpu_reload_apic_access_page(vcpu);
>   }
> 
> + /*
> +  * KVM_REQ_EVENT is not set when posted interrupts are set by
> +  * VT-d hardware, so we have to update RVI unconditionally.
> +  */
> + if (kvm_lapic_enabled(vcpu)) {
> + /*
> +  * Update architecture specific hints for APIC
> +  * virtual interrupt delivery.
> +  */
> + if (kvm_x86_ops->hwapic_irr_update)
> + kvm_x86_ops->hwapic_irr_update(vcpu,
> + kvm_lapic_find_highest_irr(vcpu));
> + }
> +
>   if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
>   kvm_apic_accept_events(vcpu);
>   if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) {
> @@ -6534,13 +6553,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>   kvm_x86_ops->enable_irq_window(vcpu);
> 
>   if (kvm_lapic_enabled(vcpu)) {
> - /*
> -  * Update architecture specific hints for APIC
> -  * virtual interrupt delivery.
> -  */
> - if (kvm_x86_ops->hwapic_irr_update)
> - kvm_x86_ops->hwapic_irr_update(vcpu,
> - kvm_lapic_find_highest_irr(vcpu));
>   update_cr8_intercept(vcpu);
>   kvm_lapic_sync_to_vapic(vcpu);
>   }
> 

I think the above code is needed, before the place where 'KVM_REQ_EVENT'
got checked in vcpu_enter_guest(), VT-d hardware can issue notification
event at any time. Consider the following scenario:

vcpu_run()
{
..  

for(;;) {
point #1
vcpu_enter_guest()
}   

point #2
}

For example, if we receive notification events issued by VT-d hardware at
point #1 and point#2, then enter vcpu_enter_guest() with 'KVM_REQ_EVENT'
not set, the interrupts cannot be delivered to guest during _this_ VM-Entry.

The point is that VT-d hardware can issue notification event at any time,
but it cannot set 'KVM_REQ_EVENT' like software does.

Maybe one thing we can do is only executing the following code when
vt-d pi is enabled,

 +  /*
 +   * KVM_REQ_EVENT is not set when posted interrupts are set by
 +   * VT-d hardware, so we have to update RVI unconditionally.
 +   */
 +  if (kvm_lapic_enabled(vcpu)) {
 +  /*
 +   * Update architecture specific hints for APIC
 +   * virtual interrupt delivery.
 +   */
 +  if (kvm_x86_ops->hwapic_irr_update)
 +  kvm_x86_ops->hwapic_irr_update(vcpu,
 +  kvm_lapic_find_highest_irr(vcpu));
 +  }
 +

And do this inside the KVM_REQ_EVENT check when VT-d PI is not enabled.

Thanks,
Feng

> 
> It may be obsolete now that we have the patch from Radim to set
> KVM_REQ_EVENT
> in vmx_sync_pir_to_irr
> (http://permalink.gmane.org/gmane.linux.kernel/2057138).
> 
> Thanks,
> 
> Paolo

linux-next: manual merge of the kvm-arm tree with the kvm tree

2015-10-15 Thread Stephen Rothwell

Hi all,

Today's linux-next merge of the kvm-arm tree got a conflict in:

  arch/x86/include/asm/kvm_host.h

between commit:

  d84f1e0755ba ("KVM: make kvm_set_msi_irq() public")

from the kvm tree and commits:

  8feb4a04dc75 ("KVM: Define a new interface kvm_intr_is_single_vcpu()")
  8fc52eb9cee4 ("KVM: Add kvm_arch_vcpu_{un}blocking callbacks")

from the kvm-arm tree.

I fixed it up (see below) and can carry the fix as necessary (no action
is required).

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au

diff --cc arch/x86/include/asm/kvm_host.h
index 53deb2750bf6,b926137d591f..
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@@ -1256,9 -1232,7 +1256,13 @@@ int x86_set_memory_region(struct kvm *k
  bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu);
  bool kvm_vcpu_is_bsp(struct kvm_vcpu *vcpu);
  
 +bool kvm_intr_is_single_vcpu(struct kvm *kvm, struct kvm_lapic_irq *irq,
 +   struct kvm_vcpu **dest_vcpu);
 +
 +void kvm_set_msi_irq(struct kvm_kernel_irq_routing_entry *e,
 +   struct kvm_lapic_irq *irq);
++
+ static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
+ static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
+ 
  #endif /* _ASM_X86_KVM_HOST_H */
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: VMX: enable LBR virtualization

2015-10-15 Thread Jian Zhou


Does it depend on vPMU _for Linux guests_ or in general?  My impression
is that LBR can be used by the guest independent of the PMU.


   I think only for Linux guests.

   I googled how to enable LBR on other guests(except Linux guests),
   e.g. Windows, and got no developer manuals about it.

   Here is an article about it:
   http://www.codeproject.com/Articles/517466/Last-branch-records-
   and-branch-tracing
   it says:
   "bit 8 of DR7 represents bit 0 of DebugCtl. This is the LBR bit."


Don't worry about the operating system in the guest: you are just
emulating a processor feature, you do not care about anything except
what is written in the Intel SDM.

You can use kvm-unit-tests
(https://git.kernel.org/cgit/virt/kvm/kvm-unit-tests.git/) to write a
test for your feature.  There are existing tests for debugging features.


  ok.

  Regards,

  Jian







--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 27/32] nvdimm: support DSM_CMD_IMPLEMENTED function

2015-10-15 Thread Xiao Guangrong




On 10/15/2015 11:01 PM, Stefan Hajnoczi wrote:

On Wed, Oct 14, 2015 at 10:52:15PM +0800, Xiao Guangrong wrote:

On 10/14/2015 05:41 PM, Stefan Hajnoczi wrote:

On Sun, Oct 11, 2015 at 11:52:59AM +0800, Xiao Guangrong wrote:

+out->len = sizeof(out->status);


out->len is uint16_t, it needs cpu_to_le16().  There may be other
instances in this patch series.



out->len is internally used only which is invisible to guest OS, i,e,
we write this value and read this value by ourself. I think it is
okay.


'out' points to guest memory.  Guest memory is untrusted so QEMU cannot
stash values there - an evil guest could modify them.

Please put the len variable on the QEMU stack or heap where the guest
cannot access it.


okay okay.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RESEND PATCH] kvm: Allow the Hyper-V vendor ID to be specified

2015-10-15 Thread Alex Williamson

According to Microsoft documentation, the signature in the standard
hypervisor CPUID leaf at 0x4000 identifies the Vendor ID and is
for reporting and diagnostic purposes only.  We can therefore allow
the user to change it to whatever they want, within the 12 character
limit.  Add a new hyperv-vendor-id option to the -cpu flag to allow
for this, ex:

 -cpu host,hv_time,hv_vendor_id=KeenlyKVM

Link: http://msdn.microsoft.com/library/windows/hardware/hh975392
Signed-off-by: Alex Williamson 
---

Cc'ing get_maintainers this time.  Any takers?  Thanks,
Alex

 target-i386/cpu-qom.h |1 +
 target-i386/cpu.c |1 +
 target-i386/kvm.c |   14 +-
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/target-i386/cpu-qom.h b/target-i386/cpu-qom.h
index c35b624..6c1eaaa 100644
--- a/target-i386/cpu-qom.h
+++ b/target-i386/cpu-qom.h
@@ -88,6 +88,7 @@ typedef struct X86CPU {
 bool hyperv_vapic;
 bool hyperv_relaxed_timing;
 int hyperv_spinlock_attempts;
+char *hyperv_vendor_id;
 bool hyperv_time;
 bool hyperv_crash;
 bool check_cpuid;
diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 05d7f26..71df546 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -3146,6 +3146,7 @@ static Property x86_cpu_properties[] = {
 DEFINE_PROP_UINT32("level", X86CPU, env.cpuid_level, 0),
 DEFINE_PROP_UINT32("xlevel", X86CPU, env.cpuid_xlevel, 0),
 DEFINE_PROP_UINT32("xlevel2", X86CPU, env.cpuid_xlevel2, 0),
+DEFINE_PROP_STRING("hv-vendor-id", X86CPU, hyperv_vendor_id),
 DEFINE_PROP_END_OF_LIST()
 };
 
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 80d1a7e..5e3ab22 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -490,7 +490,19 @@ int kvm_arch_init_vcpu(CPUState *cs)
 if (hyperv_enabled(cpu)) {
 c = _data.entries[cpuid_i++];
 c->function = HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS;
-memcpy(signature, "Microsoft Hv", 12);
+if (!cpu->hyperv_vendor_id) {
+memcpy(signature, "Microsoft Hv", 12);
+} else {
+size_t len = strlen(cpu->hyperv_vendor_id);
+
+if (len > 12) {
+fprintf(stderr,
+"hyperv-vendor-id too long, limited to 12 charaters");
+abort();
+}
+memset(signature, 0, 12);
+memcpy(signature, cpu->hyperv_vendor_id, len);
+}
 c->eax = HYPERV_CPUID_MIN;
 c->ebx = signature[0];
 c->ecx = signature[1];

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Network hangs when communicating with host

2015-10-15 Thread Dmitry Vyukov

Hello,

I am trying to run a program in lkvm sandbox so that it communicates
with a program on host. I run lkvm as:

./lkvm sandbox --disk sandbox-test --mem=2048 --cpus=4 --kernel
/arch/x86/boot/bzImage --network mode=user -- /my_prog

/my_prog then connects to a program on host over a tcp socket.
I see that host receives some data, sends some data back, but then
my_prog hangs on network read.

To localize this I wrote 2 programs (attached). ping is run on host
and pong is run from lkvm sandbox. They successfully establish tcp
connection, but after some iterations both hang on read.

Networking code in Go runtime is there for more than 3 years, widely
used in production and does not have any known bugs. However, it uses
epoll edge-triggered readiness notifications that known to be tricky.
Is it possible that lkvm contains some networking bug? Can it be
related to the data races in lkvm I reported earlier today?

I am on commit 3695adeb227813d96d9c41850703fb53a23845eb.

Thank you
package main

import (
	"log"
	"net"
)

func main() {
	c, err := net.Dial("tcp", "192.168.33.1:39921")
	if err != nil {
		log.Fatalf("failed to dial: %v", err)
	}
	log.Printf("connected")
	for {
		var buf [1]byte
		n, err := c.Write(buf[:])
		if err != nil || n != 1 {
			log.Fatalf("failed to write: %v", err)
		}
		log.Printf("write")
		n, err = c.Read(buf[:])
		if err != nil || n != 1 {
			log.Fatalf("failed to read: %v", err)
		}
		log.Printf("read")
	}
}
package main

import (
	"log"
	"net"
)

func main() {
	ln, err := net.Listen("tcp", "127.0.0.1:39921")
	if err != nil {
		log.Fatalf("failed to listen: %v", err)
	}
	log.Printf("listening")
	c, err := ln.Accept()
	if err != nil {
		log.Fatalf("failed to accept: %v", err)
	}
	log.Printf("connected")
	for {
		var buf [1]byte
		n, err := c.Read(buf[:])
		if err != nil || n != 1 {
			log.Fatalf("failed to read: %v", err)
		}
		log.Printf("read")
		n, err = c.Write(buf[:])
		if err != nil || n != 1 {
			log.Fatalf("failed to write: %v", err)
		}
		log.Printf("write")
	}
}

Re: [PATCH] VFIO: platform: AMD xgbe reset module

2015-10-15 Thread Alex Williamson

On Thu, 2015-10-15 at 21:42 +0200, Christoffer Dall wrote:
> On Thu, Oct 15, 2015 at 10:53:17AM -0600, Alex Williamson wrote:
> > On Thu, 2015-10-15 at 16:46 +0200, Eric Auger wrote:
> > > Hi Arnd,
> > > On 10/15/2015 03:59 PM, Arnd Bergmann wrote:
> > > > On Thursday 15 October 2015 14:12:28 Christoffer Dall wrote:
> > > >>>
> > > >>> enum vfio_platform_op {
> > > >>>   VFIO_PLATFORM_BIND,
> > > >>>   VFIO_PLATFORM_UNBIND,
> > > >>>   VFIO_PLATFORM_RESET,
> > > >>> };
> > > >>>
> > > >>> struct platform_driver {
> > > >>> int (*probe)(struct platform_device *);
> > > >>> int (*remove)(struct platform_device *);
> > > >>>   ...
> > > >>>   int (*vfio_manage)(struct platform_device *, enum 
> > > >>> vfio_platform_op);
> > > >>> struct device_driver driver;
> > > >>> };
> > > >>>
> > > >>> This would integrate much more closely into the platform driver 
> > > >>> framework,
> > > >>> just like the regular vfio driver integrates into the PCI framework.
> > > >>> Unlike PCI however, you can't just use the generic driver framework to
> > > >>> unbind the driver, because you still need device specific code.
> > > >>>
> > > >> Thanks for these suggestions, really helpful.
> > > >>
> > > >> What I don't understand in the latter example is how VFIO knows which
> > > >> struct platform_driver to interact with?
> > > > 
> > > > This would assume that the driver remains bound to the device, so VFIO
> > > > gets a pointer to the device from somewhere (as it does today) and then
> > > > follows the dev->driver pointer to get to the platform_driver.
> > 
> > The complexity of managing a bi-modal driver seems like far more than a
> > little bit of code duplication in a device specific reset module and
> > extends into how userspace makes devices available through vfio, so I
> > think it's too late for that discussion.
> >   
> 
> I have had extremely limited exposure to the implementation details of
> the drivers for devices relevant for VFIO platform, so apologies for
> asking stupid questions.
> 
> I'm sure that your point is valid, I just fully understand how the
> complexities of a bi-modal driver arise?
> 
> Is it simply that the reset function in a particular device driver may
> not be self-contained so therefore the whole driver would need to be
> refactored to be able to do a reset for the purpose of VFIO?

Yes, I would expect that reset function in a driver is not typically
self contained, probably referencing driver specific data structures for
register offsets, relying on various mappings that are expected to be
created from the driver probe() function, etc.  It also creates a
strange dependency on the host driver, how is the user to know they need
the native host driver loaded for full functionality in a device
assignment scenario?  Are we going to need to do a module_request() of
the host driver in vfio platform anyway?  What if there are multiple
devices and the host driver claims the others when loaded?  In the case
of PCI and SR-IOV virtual functions, I often blacklist the host VF
driver because I shouldn't need it when I only intend to use the device
in guests.  Not to mention that we'd have to drop a little bit of vfio
knowledge into each host driver that we intend to enlighten like this,
and how do we resolve whether the host driver, potentially compiled from
a separate source tree has this support.

I really don't see the layering violation in having a set of reset
functions and some lookup mechanism to pick the correct one.  The vfio
platform driver is a meta driver and sometimes we need to enlighten it a
little bit more about the device it's operating on.  For PCI we have all
sorts of special functionality for reset, but we have a standard to work
with there, so we may need to choose between a bus reset, PM reset, AF
FLR, PCIe FLR, or device specific reset, but it's all buried in the PCI
core code; where device specific resets are the exception on PCI, they
are the norm on platform.

> > > >> Also, just so I'm sure I understand correctly, VFIO_PLATFORM_UNBIND is
> > > >> then called by VFIO before the VFIO driver unbinds from the device
> > > >> (unbinding the platform driver from the device being a completely
> > > >> separate thing)?
> > > > 
> > > > This is where we'd need a little more changes for this approach. Instead
> > > > of unbinding the device from its driver, the idea would be that the
> > > > driver remains bound as far as the driver model is concerned, but
> > > > it would be in a quiescent state where no other subsystem interacts with
> > > > it (i.e. it gets unregistered from networking core or whichever it 
> > > > uses).
> > > 
> > > Currently we use the same mechanism as for PCI, ie. unbind the native
> > > driver and then bind VFIO platform driver in its place. Don't you think
> > > changing this may be a pain for user-space tools that are designed to
> > > work that way for PCI?
> > > 
> > > My personal preference would be to start

[RFC PATCH] vfio/type1: Do not support IOMMUs that allow bypass

2015-10-15 Thread Alex Williamson

We can only provide isolation if DMA is forced through the IOMMU
aperture.  Don't allow type1 to be used if this is not the case.

Signed-off-by: Alex Williamson 
---

Eric, I see a number of IOMMU drivers enable this, do the ones you
care about for ARM set geometry.force_aperture?  Thanks,

Alex

 drivers/vfio/vfio_iommu_type1.c |   12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 57d8c37..6afa9d4 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -728,6 +728,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
struct vfio_group *group, *g;
struct vfio_domain *domain, *d;
struct bus_type *bus = NULL;
+   struct iommu_domain_geometry geometry;
int ret;
 
mutex_lock(>lock);
@@ -762,6 +763,17 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
goto out_free;
}
 
+   /*
+* If a domain does not force DMA within the aperture, devices are not
+* isolated and type1 is not an appropriate IOMMU model.
+*/
+   ret = iommu_domain_get_attr(domain->domain,
+   DOMAIN_ATTR_GEOMETRY, );
+   if (ret || !geometry.force_aperture) {
+   ret = -EPERM;
+   goto out_domain;
+   }
+
if (iommu->nesting) {
int attr = 1;
 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] vfio/pci: Use kernel VPD access functions

2015-10-15 Thread Alex Williamson

The PCI VPD capability operates on a set of window registers in PCI
config space.  Writing to the address register triggers either a read
or write, depending on the setting of the PCI_VPD_ADDR_F bit within
the address register.  The data register provides either the source
for writes or the target for reads.

This model is susceptible to being broken by concurrent access, for
which the kernel has adopted a set of access functions to serialize
these registers.  Additionally, commits like 932c435caba8 ("PCI: Add
dev_flags bit to access VPD through function 0") and 7aa6ca4d39ed
("PCI: Add VPD function 0 quirk for Intel Ethernet devices") indicate
that VPD registers can be shared between functions on multifunction
devices creating dependencies between otherwise independent devices.

Fortunately it's quite easy to emulate the VPD registers, simply
storing copies of the address and data registers in memory and
triggering a VPD read or write on writes to the address register.
This allows vfio users to avoid seeing spurious register changes from
accesses on other devices and enables the use of shared quirks in the
host kernel.  We can theoretically still race with access through
sysfs, but the window of opportunity is much smaller.

Signed-off-by: Alex Williamson 
Acked-by: Mark Rustad 
---

I posted this about a month ago as an RFC and it got positive feedback
as a thing we should do.  Therefore, officially proposing it for v4.4.

 drivers/vfio/pci/vfio_pci_config.c |   70 +++-
 1 file changed, 69 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index ff75ca3..a8657ef 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -671,6 +671,73 @@ static int __init init_pci_cap_pm_perm(struct perm_bits 
*perm)
return 0;
 }
 
+static int vfio_vpd_config_write(struct vfio_pci_device *vdev, int pos,
+int count, struct perm_bits *perm,
+int offset, __le32 val)
+{
+   struct pci_dev *pdev = vdev->pdev;
+   __le16 *paddr = (__le16 *)(vdev->vconfig + pos - offset + PCI_VPD_ADDR);
+   __le32 *pdata = (__le32 *)(vdev->vconfig + pos - offset + PCI_VPD_DATA);
+   u16 addr;
+   u32 data;
+
+   /*
+* Write through to emulation.  If the write includes the upper byte
+* of PCI_VPD_ADDR, then the PCI_VPD_ADDR_F bit is written and we
+* have work to do.
+*/
+   count = vfio_default_config_write(vdev, pos, count, perm, offset, val);
+   if (count < 0 || offset > PCI_VPD_ADDR + 1 ||
+   offset + count <= PCI_VPD_ADDR + 1)
+   return count;
+
+   addr = le16_to_cpu(*paddr);
+
+   if (addr & PCI_VPD_ADDR_F) {
+   data = le32_to_cpu(*pdata);
+   if (pci_write_vpd(pdev, addr & ~PCI_VPD_ADDR_F, 4, ) != 4)
+   return count;
+   } else {
+   if (pci_read_vpd(pdev, addr, 4, ) != 4)
+   return count;
+   *pdata = cpu_to_le32(data);
+   }
+
+   /*
+* Toggle PCI_VPD_ADDR_F in the emulated PCI_VPD_ADDR register to
+* signal completion.  If an error occurs above, we assume that not
+* toggling this bit will induce a driver timeout.
+*/
+   addr ^= PCI_VPD_ADDR_F;
+   *paddr = cpu_to_le16(addr);
+
+   return count;
+}
+
+/* Permissions for Vital Product Data capability */
+static int __init init_pci_cap_vpd_perm(struct perm_bits *perm)
+{
+   if (alloc_perm_bits(perm, pci_cap_length[PCI_CAP_ID_VPD]))
+   return -ENOMEM;
+
+   perm->writefn = vfio_vpd_config_write;
+
+   /*
+* We always virtualize the next field so we can remove
+* capabilities from the chain if we want to.
+*/
+   p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE);
+
+   /*
+* Both the address and data registers are virtualized to
+* enable access through the pci_vpd_read/write functions
+*/
+   p_setw(perm, PCI_VPD_ADDR, (u16)ALL_VIRT, (u16)ALL_WRITE);
+   p_setd(perm, PCI_VPD_DATA, ALL_VIRT, ALL_WRITE);
+
+   return 0;
+}
+
 /* Permissions for PCI-X capability */
 static int __init init_pci_cap_pcix_perm(struct perm_bits *perm)
 {
@@ -790,6 +857,7 @@ void vfio_pci_uninit_perm_bits(void)
free_perm_bits(_perms[PCI_CAP_ID_BASIC]);
 
free_perm_bits(_perms[PCI_CAP_ID_PM]);
+   free_perm_bits(_perms[PCI_CAP_ID_VPD]);
free_perm_bits(_perms[PCI_CAP_ID_PCIX]);
free_perm_bits(_perms[PCI_CAP_ID_EXP]);
free_perm_bits(_perms[PCI_CAP_ID_AF]);
@@ -807,7 +875,7 @@ int __init vfio_pci_init_perm_bits(void)
 
/* Capabilities */
ret |= init_pci_cap_pm_perm(_perms[PCI_CAP_ID_PM]);
-   cap_perms[PCI_CAP_ID_VPD].writefn = vfio_raw_config_write;
+   ret |=

RE: linux-next: manual merge of the kvm-arm tree with the kvm tree

2015-10-15 Thread Wu, Feng



> -Original Message-
> From: Stephen Rothwell [mailto:s...@canb.auug.org.au]
> Sent: Friday, October 16, 2015 11:53 AM
> To: Christoffer Dall ; Marc Zyngier
> ; Marcelo Tosatti ; Gleb
> Natapov ; kvm@vger.kernel.org
> Cc: linux-n...@vger.kernel.org; linux-ker...@vger.kernel.org; Wu, Feng
> ; Paolo Bonzini ; Christian
> Borntraeger ; Cornelia Huck
> 
> Subject: linux-next: manual merge of the kvm-arm tree with the kvm tree
> 
> I fixed it up (see below) and can carry the fix as necessary (no action
> is required).

Thanks Stephen!

Thanks,
Feng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 00/20] KVM: ARM64: Add guest PMU support

2015-10-15 Thread Wei Huang



On 09/24/2015 05:31 PM, Shannon Zhao wrote:
> This patchset adds guest PMU support for KVM on ARM64. It takes
> trap-and-emulate approach. When guest wants to monitor one event, it
> will be trapped by KVM and KVM will call perf_event API to create a perf
> event and call relevant perf_event APIs to get the count value of event.
> 
> Use perf to test this patchset in guest. When using "perf list", it
> shows the list of the hardware events and hardware cache events perf
> supports. Then use "perf stat -e EVENT" to monitor some event. For
> example, use "perf stat -e cycles" to count cpu cycles and
> "perf stat -e cache-misses" to count cache misses.
> 
> Below are the outputs of "perf stat -r 5 sleep 5" when running in host
> and guest.
> 
> Host:
>  Performance counter stats for 'sleep 5' (5 runs):
> 
>   0.551428  task-clock (msec) #0.000 CPUs utilized
> ( +-  0.91% )
>  1  context-switches  #0.002 M/sec
>  0  cpu-migrations#0.000 K/sec
> 48  page-faults   #0.088 M/sec
> ( +-  1.05% )
>1150265  cycles#2.086 GHz  
> ( +-  0.92% )
>  stalled-cycles-frontend
>  stalled-cycles-backend
> 526398  instructions  #0.46  insns per cycle  
> ( +-  0.89% )
>  branches
>   9485  branch-misses #   17.201 M/sec
> ( +-  2.35% )
> 
>5.000831616 seconds time elapsed   
>( +-  0.00% )
> 
> Guest:
>  Performance counter stats for 'sleep 5' (5 runs):
> 
>   0.730868  task-clock (msec) #0.000 CPUs utilized
> ( +-  1.13% )
>  1  context-switches  #0.001 M/sec
>  0  cpu-migrations#0.000 K/sec
> 48  page-faults   #0.065 M/sec
> ( +-  0.42% )
>1642982  cycles#2.248 GHz  
> ( +-  1.04% )
>  stalled-cycles-frontend
>  stalled-cycles-backend
> 637964  instructions  #0.39  insns per cycle  
> ( +-  0.65% )
>  branches
>  10377  branch-misses #   14.198 M/sec
> ( +-  1.09% )
> 
>5.001289068 seconds time elapsed   
>( +-  0.00% )
> 

Thanks for V3. One suggestion is to run more perf stress tests, such as
"perf test". So we know the corner cases are covered as much as possible.

> This patchset can be fetched from [1] and the relevant QEMU version for
> test can be fetched from [2].
> 
> Thanks,
> Shannon
> 
> [1] https://git.linaro.org/people/shannon.zhao/linux-mainline.git  
> KVM_ARM64_PMU_v3
> [2] https://git.linaro.org/people/shannon.zhao/qemu.git  PMU_v2
> 
> Changes since v2->v3:
> * Directly use perf raw event type to create perf_event in KVM
> * Add a helper vcpu_sysreg_write
> * remove unrelated header file
> 
> Changes since v1->v2:
> * Use switch...case for registers access handler instead of adding
>   alone handler for each register
> * Try to use the sys_regs to store the register value instead of adding
>   new variables in struct kvm_pmc
> * Fix the handle of cp15 regs
> * Create a new kvm device vPMU, then userspace could choose whether to
>   create PMU
> * Fix the handle of PMU overflow interrupt
> 
> Shannon Zhao (20):
>   ARM64: Move PMU register related defines to asm/pmu.h
>   KVM: ARM64: Define PMU data structure for each vcpu
>   KVM: ARM64: Add offset defines for PMU registers
>   KVM: ARM64: Add reset and access handlers for PMCR_EL0 register
>   KVM: ARM64: Add reset and access handlers for PMSELR register
>   KVM: ARM64: Add reset and access handlers for PMCEID0 and PMCEID1
> register
>   KVM: ARM64: PMU: Add perf event map and introduce perf event creating
> function
>   KVM: ARM64: Add reset and access handlers for PMXEVTYPER register
>   KVM: ARM64: Add reset and access handlers for PMXEVCNTR register
>   KVM: ARM64: Add reset and access handlers for PMCCNTR register
>   KVM: ARM64: Add reset and access handlers for PMCNTENSET and
> PMCNTENCLR register
>   KVM: ARM64: Add reset and access handlers for PMINTENSET and
> PMINTENCLR register
>   KVM: ARM64: Add reset and access handlers for PMOVSSET and PMOVSCLR
> register
>   KVM: ARM64: Add reset and access handlers for PMUSERENR register
>   KVM: ARM64: Add reset and access handlers for PMSWINC register
>   KVM: ARM64: Add access handlers for PMEVCNTRn and PMEVTYPERn register
>   KVM: ARM64: Add PMU overflow interrupt routing
>   KVM: ARM64: Reset PMU state when resetting vcpu
>   KVM: ARM64: Free perf event of PMU when destroying vcpu
>   KVM: ARM64: Add a new kvm ARM PMU device
> 
>

Re: [PATCH v3 04/20] KVM: ARM64: Add reset and access handlers for PMCR_EL0 register

2015-10-15 Thread Wei Huang



On 09/24/2015 05:31 PM, Shannon Zhao wrote:
> Add reset handler which gets host value of PMCR_EL0 and make writable
> bits architecturally UNKNOWN. Add a common access handler for PMU
> registers which emulates writing and reading register and add emulation
> for PMCR.
> 
> Signed-off-by: Shannon Zhao 
> ---
>  arch/arm64/kvm/sys_regs.c | 81 
> +--
>  1 file changed, 79 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> index b41607d..60c0842 100644
> --- a/arch/arm64/kvm/sys_regs.c
> +++ b/arch/arm64/kvm/sys_regs.c
> @@ -33,6 +33,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  
> @@ -446,6 +447,53 @@ static void reset_mpidr(struct kvm_vcpu *vcpu, const 
> struct sys_reg_desc *r)
>   vcpu_sys_reg(vcpu, MPIDR_EL1) = (1ULL << 31) | mpidr;
>  }
>  
> +static void vcpu_sysreg_write(struct kvm_vcpu *vcpu,
> +   const struct sys_reg_desc *r, u64 val)
> +{
> + if (!vcpu_mode_is_32bit(vcpu))
> + vcpu_sys_reg(vcpu, r->reg) = val;
> + else
> + vcpu_cp15(vcpu, r->reg) = lower_32_bits(val);
> +}
> +
> +static void reset_pmcr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r)
> +{
> + u64 pmcr, val;
> +
> + asm volatile("mrs %0, pmcr_el0\n" : "=r" (pmcr));
> + /* Writable bits of PMCR_EL0 (ARMV8_PMCR_MASK) is reset to UNKNOWN*/
> + val = (pmcr & ~ARMV8_PMCR_MASK) | (ARMV8_PMCR_MASK & 0xdecafbad);

Two comments:
(1) In Patch 1, ARMV8_PMCR_MASK is defined as 0x3f. According to ARMv8
spec, PMCR_EL0.LC (bit 6) is also writable. Should ARMV8_PMCR_MASK be 0x7f?
(2) According to spec the PMCR_EL0.E bit reset to 0, not UNKNOWN.


> + vcpu_sysreg_write(vcpu, r, val);
> +}
> +
> +/* PMU registers accessor. */
> +static bool access_pmu_regs(struct kvm_vcpu *vcpu,
> + const struct sys_reg_params *p,
> + const struct sys_reg_desc *r)
> +{
> + unsigned long val;
> +
> + if (p->is_write) {
> + switch (r->reg) {
> + case PMCR_EL0: {
> + /* Only update writeable bits of PMCR */
> + val = vcpu_sys_reg(vcpu, r->reg);
> + val &= ~ARMV8_PMCR_MASK;
> + val |= *vcpu_reg(vcpu, p->Rt) & ARMV8_PMCR_MASK;
> + vcpu_sys_reg(vcpu, r->reg) = val;
> + break;
> + }
> + default:
> + vcpu_sys_reg(vcpu, r->reg) = *vcpu_reg(vcpu, p->Rt);
> + break;
> + }
> + } else {
> + *vcpu_reg(vcpu, p->Rt) = vcpu_sys_reg(vcpu, r->reg);
> + }
> +
> + return true;
> +}
> +
>  /* Silly macro to expand the DBG{BCR,BVR,WVR,WCR}n_EL1 registers in one go */
>  #define DBG_BCR_BVR_WCR_WVR_EL1(n)   \
>   /* DBGBVRn_EL1 */   \
> @@ -637,7 +685,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
>  
>   /* PMCR_EL0 */
>   { Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b000),
> -   trap_raz_wi },
> +   access_pmu_regs, reset_pmcr, PMCR_EL0, },
>   /* PMCNTENSET_EL0 */
>   { Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b001),
> trap_raz_wi },
> @@ -871,6 +919,34 @@ static const struct sys_reg_desc cp14_64_regs[] = {
>   { Op1( 0), CRm( 2), .access = trap_raz_wi },
>  };
>  
> +/* PMU CP15 registers accessor. */
> +static bool access_pmu_cp15_regs(struct kvm_vcpu *vcpu,
> +  const struct sys_reg_params *p,
> +  const struct sys_reg_desc *r)
> +{
> + unsigned long val;
> +
> + if (p->is_write) {
> + switch (r->reg) {
> + case c9_PMCR: {
> + /* Only update writeable bits of PMCR */
> + val = vcpu_cp15(vcpu, r->reg);
> + val &= ~ARMV8_PMCR_MASK;
> + val |= *vcpu_reg(vcpu, p->Rt) & ARMV8_PMCR_MASK;
> + vcpu_cp15(vcpu, r->reg) = val;
> + break;
> + }
> + default:
> + vcpu_cp15(vcpu, r->reg) = *vcpu_reg(vcpu, p->Rt);
> + break;
> + }
> + } else {
> + *vcpu_reg(vcpu, p->Rt) = vcpu_cp15(vcpu, r->reg);
> + }
> +
> + return true;
> +}
> +
>  /*
>   * Trapped cp15 registers. TTBR0/TTBR1 get a double encoding,
>   * depending on the way they are accessed (as a 32bit or a 64bit
> @@ -899,7 +975,8 @@ static const struct sys_reg_desc cp15_regs[] = {
>   { Op1( 0), CRn( 7), CRm(14), Op2( 2), access_dcsw },
>  
>   /* PMU */
> - { Op1( 0), CRn( 9), CRm(12), Op2( 0), trap_raz_wi },
> + { Op1( 0), CRn( 9), CRm(12), Op2( 0), access_pmu_cp15_regs,
> +   reset_pmcr, c9_PMCR },
>   { Op1( 0), CRn( 9), CRm(12), Op2(

[PATCH] KVM: PPC: Implement extension to report number of memslots

2015-10-15 Thread Nikunj A Dadhania

QEMU assumes 32 memslots if this extension is not implemented. Although,
current value of KVM_USER_MEM_SLOTS is 32, once KVM_USER_MEM_SLOTS
changes QEMU would take a wrong value.

Signed-off-by: Nikunj A Dadhania 
---
 arch/powerpc/kvm/powerpc.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 2e51289..6fd2405 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -559,6 +559,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
else
r = num_online_cpus();
break;
+   case KVM_CAP_NR_MEMSLOTS:
+   r = KVM_USER_MEM_SLOTS;
+   break;
case KVM_CAP_MAX_VCPUS:
r = KVM_MAX_VCPUS;
break;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM: PPC: Implement extension to report number of memslots

2015-10-15 Thread Nikunj A Dadhania

QEMU assumes 32 memslots if this extension is not implemented. Although,
current value of KVM_USER_MEM_SLOTS is 32, once KVM_USER_MEM_SLOTS
changes QEMU would take a wrong value.

Signed-off-by: Nikunj A Dadhania 
---
 arch/powerpc/kvm/powerpc.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 2e51289..6fd2405 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -559,6 +559,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
else
r = num_online_cpus();
break;
+   case KVM_CAP_NR_MEMSLOTS:
+   r = KVM_USER_MEM_SLOTS;
+   break;
case KVM_CAP_MAX_VCPUS:
r = KVM_MAX_VCPUS;
break;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] VFIO: platform: AMD xgbe reset module

2015-10-15 Thread Alex Williamson

On Thu, 2015-10-15 at 16:46 +0200, Eric Auger wrote:
> Hi Arnd,
> On 10/15/2015 03:59 PM, Arnd Bergmann wrote:
> > On Thursday 15 October 2015 14:12:28 Christoffer Dall wrote:
> >>>
> >>> enum vfio_platform_op {
> >>>   VFIO_PLATFORM_BIND,
> >>>   VFIO_PLATFORM_UNBIND,
> >>>   VFIO_PLATFORM_RESET,
> >>> };
> >>>
> >>> struct platform_driver {
> >>> int (*probe)(struct platform_device *);
> >>> int (*remove)(struct platform_device *);
> >>>   ...
> >>>   int (*vfio_manage)(struct platform_device *, enum vfio_platform_op);
> >>> struct device_driver driver;
> >>> };
> >>>
> >>> This would integrate much more closely into the platform driver framework,
> >>> just like the regular vfio driver integrates into the PCI framework.
> >>> Unlike PCI however, you can't just use the generic driver framework to
> >>> unbind the driver, because you still need device specific code.
> >>>
> >> Thanks for these suggestions, really helpful.
> >>
> >> What I don't understand in the latter example is how VFIO knows which
> >> struct platform_driver to interact with?
> > 
> > This would assume that the driver remains bound to the device, so VFIO
> > gets a pointer to the device from somewhere (as it does today) and then
> > follows the dev->driver pointer to get to the platform_driver.

The complexity of managing a bi-modal driver seems like far more than a
little bit of code duplication in a device specific reset module and
extends into how userspace makes devices available through vfio, so I
think it's too late for that discussion.

> >> Also, just so I'm sure I understand correctly, VFIO_PLATFORM_UNBIND is
> >> then called by VFIO before the VFIO driver unbinds from the device
> >> (unbinding the platform driver from the device being a completely
> >> separate thing)?
> > 
> > This is where we'd need a little more changes for this approach. Instead
> > of unbinding the device from its driver, the idea would be that the
> > driver remains bound as far as the driver model is concerned, but
> > it would be in a quiescent state where no other subsystem interacts with
> > it (i.e. it gets unregistered from networking core or whichever it uses).
> 
> Currently we use the same mechanism as for PCI, ie. unbind the native
> driver and then bind VFIO platform driver in its place. Don't you think
> changing this may be a pain for user-space tools that are designed to
> work that way for PCI?
> 
> My personal preference would be to start with your first proposal since
> it looks (to me) less complex and "unknown" that the 2d approach.
> 
> Let's wait for Alex opinion too...

I thought the reason we took the approach we have now is so that we
don't have reset code loaded into the kernel unless we have a device
that needs it.  Therefore we don't really want to preemptively load all
the reset drivers and have them do a registration.  The unfortunate
side-effect of that is the platform code needs to go looking for the
driver.  We do that via the __symbol_get() trick, which only fails
without modules because the underscore variant isn't defined in that
case.  I remember asking Eric previously why we're using that rather
than symbol_get(), I've since forgotten his answer, but the fact that
__symbol_get() is only defined for modules makes it moot, we either need
to make symbol_get() work or define __symbol_get() for non-module
builds.

Otherwise, we should probably abandon the idea of these reset functions
being modules and build them into the vfio platform driver (which would
still be less loaded, dead code than a bi-modal host driver).  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] VFIO: platform: AMD xgbe reset module

2015-10-15 Thread Christoffer Dall

On Thu, Oct 15, 2015 at 5:49 PM, Arnd Bergmann  wrote:
> On Thursday 15 October 2015 17:03:21 Christoffer Dall wrote:
>> On Thu, Oct 15, 2015 at 04:55:13PM +0200, Arnd Bergmann wrote:
>> > On Thursday 15 October 2015 16:46:09 Eric Auger wrote:
>> > > >
>> > > > This is where we'd need a little more changes for this approach. 
>> > > > Instead
>> > > > of unbinding the device from its driver, the idea would be that the
>> > > > driver remains bound as far as the driver model is concerned, but
>> > > > it would be in a quiescent state where no other subsystem interacts 
>> > > > with
>> > > > it (i.e. it gets unregistered from networking core or whichever it 
>> > > > uses).
>> > >
>> > > Currently we use the same mechanism as for PCI, ie. unbind the native
>> > > driver and then bind VFIO platform driver in its place. Don't you think
>> > > changing this may be a pain for user-space tools that are designed to
>> > > work that way for PCI?
>> > >
>> > > My personal preference would be to start with your first proposal since
>> > > it looks (to me) less complex and "unknown" that the 2d approach.
>> >
>> > We certainly can't easily change from one approach to the other without
>> > breaking user expectations, so the decision needs to be made carefully.
>> >
>> > The main observation here is that platform devices are unlike PCI in this
>> > regard because they need extra per-device code. I have argued in the
>> > past that we should not reuse the "VFIO" name here because it's actually
>> > something else.
>>
>> I've adjusted to consider VFIO a general purpose framework for mapping
>> device resources into userspace/VMs, and there are certainly a lot of
>> commonality with both PCI, platform, and potentially other devices for
>> that to make sense.
>>
>>
>> > On the other hand, there are a lot of commonalities,
>> > we just have to make sure we don't try to force the code into one model
>> > that doesn't really work just to make it look more like PCI VFIO.
>> >
>>
>> But given that we now have code for platform device passthrough that
>> works in both QEMU and the kernel side and is actually useful for
>> people, is there a clear technical advantage to go back and rework thaat
>> at this point?
>>
>> Don't get me wrong, I like the idea of having a single driver bound to a
>> platform device, and then that's it, but it just feels like that
>> discussion doesn't necessarily belong in the context of a patch that
>> 'just' seeks to add reset functionality for a specific device for VFIO?
>
> Ah, this is for qemu/kvm? If there is already upstream qemu code that
> finds it easier to use this approach, it's certainly more logical to
> deal with it in my first approach than the second.

yes, indeed.

>
> I was thinking of ODP as the primary user, and that wouldn't need
> the interface to be consistent with PCI as much, because the code
> is inherently device specific there anyway.
>

I didn't think about how this applies to ODP (haven't had time to look
at ODP in any technical detail TBH), but it sounds like ODP could be
adapted to use either approach.

In any case we can always transition towards your 2nd approach in a
userspace compatible way later, if we find good reasons to.

Thanks,
-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [edk2] KVM: MTRR: fix memory type handling if MTRR is completely disabled

2015-10-15 Thread Kinney, Michael D

Laszlo,

There is already a PCD for this timeout that is used by CpuMpPei.

gUefiCpuPkgTokenSpaceGuid.PcdCpuApInitTimeOutInMicroSeconds

I noticed that CpuDxe is using a hard coded AP timeout.  I think we should just 
use this same PCD for both the PEI and DXE CPU module and then set it for OVMF 
to the compatible value.

Mike

>-Original Message-
>From: edk2-devel [mailto:edk2-devel-boun...@lists.01.org] On Behalf Of
>Laszlo Ersek
>Sent: Thursday, October 15, 2015 9:19 AM
>To: Xiao Guangrong
>Cc: kvm@vger.kernel.org; Justen, Jordan L; edk2-de...@ml01.01.org; Alex
>Williamson; Chen Fan; Paolo Bonzini; Wanpeng Li
>Subject: Re: [edk2] KVM: MTRR: fix memory type handling if MTRR is
>completely disabled
>
>CC'ing Jordan and Chen Fan.
>
>On 10/15/15 09:10, Xiao Guangrong wrote:
>>
>>
>> On 10/15/2015 02:58 PM, Janusz wrote:
>>> W dniu 15.10.2015 o 08:41, Xiao Guangrong pisze:


 On 10/15/2015 02:19 PM, Janusz wrote:
> W dniu 15.10.2015 o 06:19, Xiao Guangrong pisze:
>>
>>
>>
>> Well, the bug may be not in KVM. When this bug happened, i saw
>OVMF
>> only checked 1 CPU out, there is the log from OVMF's debug input:
>>
>> Flushing GCD
>> Flushing GCD
>> Flushing GCD
>> Flushing GCD
>> Flushing GCD
>> Flushing GCD
>> Flushing GCD
>> Flushing GCD
>> Flushing GCD
>> Flushing GCDs
>> Detect CPU count: 1
>>
>> So that the startup code has been freed however the APs are still
>> running,
>> i think that why we saw the vCPUs executed on unexpected address.
>>
>> After digging into OVMF's code, i noticed that BSP CPU waits for APs
>> for a fixed timer period, however, KVM recent changes require zap all
>> mappings if CR0.CD is changed, that means the APs need more time to
>> startup.
>>
>> After following changes to OVMF, the bug is completely gone on my
>> side:
>>
>> --- a/UefiCpuPkg/CpuDxe/ApStartup.c
>> +++ b/UefiCpuPkg/CpuDxe/ApStartup.c
>> @@ -454,7 +454,9 @@ StartApsStackless (
>>  //
>>  // Wait 100 milliseconds for APs to arrive at the ApEntryPoint
>> routine
>>  //
>> -  MicroSecondDelay (100 * 1000);
>> +  MicroSecondDelay (10 * 100 * 1000);
>>
>>  return EFI_SUCCESS;
>>}
>>
>> Janusz, could you please check this instead? You can switch to your
>> previous kernel to do this test.
>>
>>
> Ok, now first time when I started VM I was able to start system
> successfully. When I turned it off and started it again, it
> restarted my
> vm at system boot couple of times. Sometimes I also get very high cpu
> usage for no reason. Also, I get less fps in GTA 5 than in kernel
> 4.1, I
> get something like 30-55, but on 4.1 I get all the time 60 fps. This is
> my new log: https://bpaste.net/show/61a122ad7fe5
>

 Just confirm: the Qemu internal error did not appear any more, right?
>>> Yes, when I reverted your first patch, switched to -vga std from -vga
>>> none and didn't passthrough my GPU (case when I got this internal
>>> error), vm started without problem. I even didn't get any VM restarts
>>> like with passthrough
>>>
>>
>> Wow, it seems we have fixed the QEMU internal error now. :)
>>
>> Recurrently, Paolo has reverted some MTRR patches, was your test
>> based on these reverted patches?
>>
>> The GPU passthrough issue may be related to vfio (not sure), Alex, do
>> you have any idea?
>>
>> Laszlo, could you please check the root case is reasonable and fix it in
>> OVMF if it's right?
>
>The code that you have found is in edk2's EFI_MP_SERVICES_PROTOCOL
>implementation -- more closely, its initial CPU counter code --, from
>edk2 git commit 533263ee5a7f. It is not specific to OVMF -- it is
>generic edk2 code for Intel processors. (I'm CC'ing Jordan and Chen Fan
>because they authored the patch in question.)
>
>If VCPUs need more time to rendezvous than written in the code, on
>recent KVM, then I think we should introduce a new FixedPCD in
>UefiCpuPkg (practically: a compile time constant) for the timeout. Which
>is not hard to do.
>
>However, we'll need two things:
>- an idea about the concrete rendezvous timeout to set, from OvmfPkg
>
>- a *detailed* explanation / elaboration on your words:
>
>  "KVM recent changes require zap all mappings if CR0.CD is changed,
>  that means the APs need more time to startup"
>
>  Preferably with references to Linux kernel commits and the Intel SDM,
>  so that n00bs like me can get a fleeting idea. Do you mean that with
>  caching disabled, the APs execute their rendezvous code (from memory)
>  more slowly?
>
>> BTW, OVMF handles #UD with no trace - nothing is killed, and no call trace
>> in the debug input...
>
>There *is* a trace (of any unexpected exception -- at least for the
>BSP), but unfortunately its location is not intuitive.
>
>The exception handler that is

Re: [PATCH] KVM: x86: move steal time initialization to vcpu entry time

2015-10-15 Thread David Matlack

On Wed, Oct 14, 2015 at 3:33 PM, Marcelo Tosatti  wrote:
>
> As reported at https://bugs.launchpad.net/qemu/+bug/1494350,
> it is possible to have vcpu->arch.st.last_steal initialized
> from a thread other than vcpu thread, say the iothread, via
> KVM_SET_MSRS.
>
> Which can cause an overflow later (when subtracting from vcpu threads
> sched_info.run_delay).
>
> To avoid that, move steal time accumulation to vcpu entry time,
> before copying steal time data to guest.
>
> Signed-off-by: Marcelo Tosatti 

Reviewed-by: David Matlack 

>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 8f0f6ec..0e0332e 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2030,6 +2030,8 @@ static void accumulate_steal_time(struct kvm_vcpu *vcpu)
>
>  static void record_steal_time(struct kvm_vcpu *vcpu)
>  {
> +   accumulate_steal_time(vcpu);
> +
> if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
> return;
>
> @@ -2182,12 +2184,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
> if (!(data & KVM_MSR_ENABLED))
> break;
>
> -   vcpu->arch.st.last_steal = current->sched_info.run_delay;
> -
> -   preempt_disable();
> -   accumulate_steal_time(vcpu);
> -   preempt_enable();
> -
> kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
>
> break;
> @@ -2830,7 +2826,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> vcpu->cpu = cpu;
> }
>
> -   accumulate_steal_time(vcpu);
> kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
>  }
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] VFIO: platform: AMD xgbe reset module

2015-10-15 Thread Christoffer Dall

On Thu, Oct 15, 2015 at 10:53:17AM -0600, Alex Williamson wrote:
> On Thu, 2015-10-15 at 16:46 +0200, Eric Auger wrote:
> > Hi Arnd,
> > On 10/15/2015 03:59 PM, Arnd Bergmann wrote:
> > > On Thursday 15 October 2015 14:12:28 Christoffer Dall wrote:
> > >>>
> > >>> enum vfio_platform_op {
> > >>>   VFIO_PLATFORM_BIND,
> > >>>   VFIO_PLATFORM_UNBIND,
> > >>>   VFIO_PLATFORM_RESET,
> > >>> };
> > >>>
> > >>> struct platform_driver {
> > >>> int (*probe)(struct platform_device *);
> > >>> int (*remove)(struct platform_device *);
> > >>>   ...
> > >>>   int (*vfio_manage)(struct platform_device *, enum 
> > >>> vfio_platform_op);
> > >>> struct device_driver driver;
> > >>> };
> > >>>
> > >>> This would integrate much more closely into the platform driver 
> > >>> framework,
> > >>> just like the regular vfio driver integrates into the PCI framework.
> > >>> Unlike PCI however, you can't just use the generic driver framework to
> > >>> unbind the driver, because you still need device specific code.
> > >>>
> > >> Thanks for these suggestions, really helpful.
> > >>
> > >> What I don't understand in the latter example is how VFIO knows which
> > >> struct platform_driver to interact with?
> > > 
> > > This would assume that the driver remains bound to the device, so VFIO
> > > gets a pointer to the device from somewhere (as it does today) and then
> > > follows the dev->driver pointer to get to the platform_driver.
> 
> The complexity of managing a bi-modal driver seems like far more than a
> little bit of code duplication in a device specific reset module and
> extends into how userspace makes devices available through vfio, so I
> think it's too late for that discussion.
>   

I have had extremely limited exposure to the implementation details of
the drivers for devices relevant for VFIO platform, so apologies for
asking stupid questions.

I'm sure that your point is valid, I just fully understand how the
complexities of a bi-modal driver arise?

Is it simply that the reset function in a particular device driver may
not be self-contained so therefore the whole driver would need to be
refactored to be able to do a reset for the purpose of VFIO?

> > >> Also, just so I'm sure I understand correctly, VFIO_PLATFORM_UNBIND is
> > >> then called by VFIO before the VFIO driver unbinds from the device
> > >> (unbinding the platform driver from the device being a completely
> > >> separate thing)?
> > > 
> > > This is where we'd need a little more changes for this approach. Instead
> > > of unbinding the device from its driver, the idea would be that the
> > > driver remains bound as far as the driver model is concerned, but
> > > it would be in a quiescent state where no other subsystem interacts with
> > > it (i.e. it gets unregistered from networking core or whichever it uses).
> > 
> > Currently we use the same mechanism as for PCI, ie. unbind the native
> > driver and then bind VFIO platform driver in its place. Don't you think
> > changing this may be a pain for user-space tools that are designed to
> > work that way for PCI?
> > 
> > My personal preference would be to start with your first proposal since
> > it looks (to me) less complex and "unknown" that the 2d approach.
> > 
> > Let's wait for Alex opinion too...
> 
> I thought the reason we took the approach we have now is so that we
> don't have reset code loaded into the kernel unless we have a device
> that needs it.  Therefore we don't really want to preemptively load all
> the reset drivers and have them do a registration.  The unfortunate
> side-effect of that is the platform code needs to go looking for the
> driver. 

Does the current approach have a separate driver for doing VFIO reset or
does it reuse the existing device driver?

Why does the driver registering itself instead of using symbol_get
imply that we must load reset drivers that we don't need?

Thanks,
-Christoffer


> We do that via the __symbol_get() trick, which only fails
> without modules because the underscore variant isn't defined in that
> case.  I remember asking Eric previously why we're using that rather
> than symbol_get(), I've since forgotten his answer, but the fact that
> __symbol_get() is only defined for modules makes it moot, we either need
> to make symbol_get() work or define __symbol_get() for non-module
> builds.
> 
> Otherwise, we should probably abandon the idea of these reset functions
> being modules and build them into the vfio platform driver (which would
> still be less loaded, dead code than a bi-modal host driver).  Thanks,
> 
> Alex
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] book3s_hv: Handle H_DOORBELL on the guest exit path

2015-10-15 Thread Gautham R. Shenoy

Currently a CPU running a guest can receive a H_DOORBELL in the
following two cases:
1) When the CPU is napping due to CEDE or there not being a guest
vcpu.
2) The CPU is running the guest vcpu.

Case 1), the doorbell message is not cleared since we were waking up
from nap. Hence when the EE bit gets set on transition from guest to
host, the H_DOORBELL interrupt is delivered to the host and the
corresponding handler is invoked.

However in Case 2), the message gets cleared by the action of taking
the H_DOORBELL interrupt. Since the CPU was running a guest, instead
of invoking the doorbell handler, the code invokes the second-level
interrupt handler to switch the context from the guest to the host. At
this point the setting of the EE bit doesn't result in the CPU getting
the doorbell interrupt since it has already been delivered once. So,
the handler for this doorbell is never invoked!

This causes softlockups if the missed DOORBELL was as IPI sent from a
sibling subcore CPU.

This patch fixes it by explitly invoking the doorbell handler on the
exit path if the exit reason is H_DOORBELL similar to the way an
EXTERNAL interrupt is handled. Since this will also handle Case 1), we
can unconditionally clear the doorbell message in
kvmppc_check_wake_reason.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index b98889e..106c7f9 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -150,6 +150,8 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
cmpwi   cr1, r12, BOOK3S_INTERRUPT_MACHINE_CHECK
cmpwi   r12, BOOK3S_INTERRUPT_EXTERNAL
beq 11f
+   cmpwi   r12, BOOK3S_INTERRUPT_H_DOORBELL
+   beq 15f /* Invoke the H_DOORBELL handler */
cmpwi   cr2, r12, BOOK3S_INTERRUPT_HMI
beq cr2, 14f/* HMI check */
 
@@ -174,6 +176,10 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
mtspr   SPRN_HSRR1, r7
b   hmi_exception_after_realmode
 
+15:mtspr SPRN_HSRR0, r8
+   mtspr SPRN_HSRR1, r7
+   ba0xe80
+
 kvmppc_primary_no_guest:
/* We handle this much like a ceded vcpu */
/* put the HDEC into the DEC, since HDEC interrupts don't wake us */
@@ -2436,14 +2442,19 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
 
/* hypervisor doorbell */
 3: li  r12, BOOK3S_INTERRUPT_H_DOORBELL
+
+   /*
+* Clear the doorbell as we will invoke the handler
+* explicitly in the guest exit path.
+*/
+   lis r6, (PPC_DBELL_SERVER << (63-36))@h
+   PPC_MSGCLR(6)
/* see if it's a host IPI */
li  r3, 1
lbz r0, HSTATE_HOST_IPI(r13)
cmpwi   r0, 0
bnelr
-   /* if not, clear it and return -1 */
-   lis r6, (PPC_DBELL_SERVER << (63-36))@h
-   PPC_MSGCLR(6)
+   /* if not, return -1 */
li  r3, -1
blr
 
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] book3s_hv: Handle H_DOORBELL on the guest exit path

2015-10-15 Thread Gautham R. Shenoy

Currently a CPU running a guest can receive a H_DOORBELL in the
following two cases:
1) When the CPU is napping due to CEDE or there not being a guest
vcpu.
2) The CPU is running the guest vcpu.

Case 1), the doorbell message is not cleared since we were waking up
from nap. Hence when the EE bit gets set on transition from guest to
host, the H_DOORBELL interrupt is delivered to the host and the
corresponding handler is invoked.

However in Case 2), the message gets cleared by the action of taking
the H_DOORBELL interrupt. Since the CPU was running a guest, instead
of invoking the doorbell handler, the code invokes the second-level
interrupt handler to switch the context from the guest to the host. At
this point the setting of the EE bit doesn't result in the CPU getting
the doorbell interrupt since it has already been delivered once. So,
the handler for this doorbell is never invoked!

This causes softlockups if the missed DOORBELL was as IPI sent from a
sibling subcore CPU.

This patch fixes it by explitly invoking the doorbell handler on the
exit path if the exit reason is H_DOORBELL similar to the way an
EXTERNAL interrupt is handled. Since this will also handle Case 1), we
can unconditionally clear the doorbell message in
kvmppc_check_wake_reason.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index b98889e..106c7f9 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -150,6 +150,8 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
cmpwi   cr1, r12, BOOK3S_INTERRUPT_MACHINE_CHECK
cmpwi   r12, BOOK3S_INTERRUPT_EXTERNAL
beq 11f
+   cmpwi   r12, BOOK3S_INTERRUPT_H_DOORBELL
+   beq 15f /* Invoke the H_DOORBELL handler */
cmpwi   cr2, r12, BOOK3S_INTERRUPT_HMI
beq cr2, 14f/* HMI check */
 
@@ -174,6 +176,10 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
mtspr   SPRN_HSRR1, r7
b   hmi_exception_after_realmode
 
+15:mtspr SPRN_HSRR0, r8
+   mtspr SPRN_HSRR1, r7
+   ba0xe80
+
 kvmppc_primary_no_guest:
/* We handle this much like a ceded vcpu */
/* put the HDEC into the DEC, since HDEC interrupts don't wake us */
@@ -2436,14 +2442,19 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
 
/* hypervisor doorbell */
 3: li  r12, BOOK3S_INTERRUPT_H_DOORBELL
+
+   /*
+* Clear the doorbell as we will invoke the handler
+* explicitly in the guest exit path.
+*/
+   lis r6, (PPC_DBELL_SERVER << (63-36))@h
+   PPC_MSGCLR(6)
/* see if it's a host IPI */
li  r3, 1
lbz r0, HSTATE_HOST_IPI(r13)
cmpwi   r0, 0
bnelr
-   /* if not, clear it and return -1 */
-   lis r6, (PPC_DBELL_SERVER << (63-36))@h
-   PPC_MSGCLR(6)
+   /* if not, return -1 */
li  r3, -1
blr
 
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v9 17/18] KVM: Update Posted-Interrupts Descriptor when vCPU is blocked

2015-10-15 Thread David Matlack

On Wed, Oct 14, 2015 at 6:33 PM, Wu, Feng  wrote:
>
>> -Original Message-
>> From: David Matlack [mailto:dmatl...@google.com]
>> Sent: Thursday, October 15, 2015 7:41 AM
>> To: Wu, Feng 
>> Cc: Paolo Bonzini ; alex.william...@redhat.com; Joerg
>> Roedel ; Marcelo Tosatti ;
>> eric.au...@linaro.org; kvm list ; iommu@lists.linux-
>> foundation.org; linux-ker...@vger.kernel.org
>> Subject: Re: [PATCH v9 17/18] KVM: Update Posted-Interrupts Descriptor when
>> vCPU is blocked
>>
>> Hi Feng.
>>
>> On Fri, Sep 18, 2015 at 7:29 AM, Feng Wu  wrote:
>> > This patch updates the Posted-Interrupts Descriptor when vCPU
>> > is blocked.
>> >
>> > pre-block:
>> > - Add the vCPU to the blocked per-CPU list
>> > - Set 'NV' to POSTED_INTR_WAKEUP_VECTOR
>> >
>> > post-block:
>> > - Remove the vCPU from the per-CPU list
>>
>> I'm wondering what happens if a posted interrupt arrives at the IOMMU
>> after pre-block and before post-block.
>>
>> In pre_block, NV is set to POSTED_INTR_WAKEUP_VECTOR. IIUC, this means
>> future posted interrupts will not trigger "Posted-Interrupt Processing"
>> (PIR will not get copied to VIRR). Instead, the IOMMU will do ON := 1,
>> PIR |= (1 << vector), and send POSTED_INTR_WAKEUP_VECTOR. PIWV calls
>> wakeup_handler which does kvm_vcpu_kick. kvm_vcpu_kick does a wait-queue
>> wakeup and possibly a scheduler ipi.
>>
>> But the VCPU is sitting in kvm_vcpu_block. It spins and/or schedules
>> (wait queue) until it has a reason to wake up. I couldn't find a code
>> path from kvm_vcpu_block that lead to checking ON or PIR. How does the
>> blocked VCPU "receive" the posted interrupt? (And when does Posted-
>> Interrupt Processing get triggered?)
>
> In the pre_block, it also change the 'NDST' filed to the pCPU, on which the 
> vCPU
> is put to the per-CPU list 'blocked_vcpu_on_cpu', so when posted-interrupts
> come it, it will sent the wakeup notification event to the pCPU above, then in
> the wakeup_handler, it can find the vCPU from the per-CPU list, hence
> kvm_vcpu_kick can wake up it.

Thank you for your response. I was actually confused about something
else. After wakeup_handler->kvm_vcpu_kick causes the vcpu to wake up,
that vcpu calls kvm_vcpu_check_block() to check if there are pending
events, otherwise the vcpu goes back to sleep. I had trouble yesterday
finding the code path from kvm_vcpu_check_block() which checks PIR/ON.

But after spending more time reading the source code this morning I
found that kvm_vcpu_check_block() eventually calls into
vmx_sync_pir_to_irr(), which copies PIR to IRR and clears ON. And then
apic_find_highest_irr() detects the pending posted interrupt.

>
> Thanks,
> Feng
>
>>
>> Thanks!
>>
>> >
>> > Signed-off-by: Feng Wu 
>> > ---
>> > v9:
>> > - Add description for blocked_vcpu_on_cpu_lock in
>> Documentation/virtual/kvm/locking.txt
>> > - Check !kvm_arch_has_assigned_device(vcpu->kvm) first, then
>> >   !irq_remapping_cap(IRQ_POSTING_CAP)
>> >
>> > v8:
>> > - Rename 'pi_pre_block' to 'pre_block'
>> > - Rename 'pi_post_block' to 'post_block'
>> > - Change some comments
>> > - Only add the vCPU to the blocking list when the VM has assigned devices.
>> >
>> >  Documentation/virtual/kvm/locking.txt |  12 +++
>> >  arch/x86/include/asm/kvm_host.h   |  13 +++
>> >  arch/x86/kvm/vmx.c| 153
>> ++
>> >  arch/x86/kvm/x86.c|  53 +---
>> >  include/linux/kvm_host.h  |   3 +
>> >  virt/kvm/kvm_main.c   |   3 +
>> >  6 files changed, 227 insertions(+), 10 deletions(-)
>> >
>> > diff --git a/Documentation/virtual/kvm/locking.txt
>> b/Documentation/virtual/kvm/locking.txt
>> > index d68af4d..19f94a6 100644
>> > --- a/Documentation/virtual/kvm/locking.txt
>> > +++ b/Documentation/virtual/kvm/locking.txt
>> > @@ -166,3 +166,15 @@ Comment:   The srcu read lock must be held while
>> accessing memslots (e.g.
>> > MMIO/PIO address->device structure mapping (kvm->buses).
>> > The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu
>> > if it is needed by multiple functions.
>> > +
>> > +Name:  blocked_vcpu_on_cpu_lock
>> > +Type:  spinlock_t
>> > +Arch:  x86
>> > +Protects:  blocked_vcpu_on_cpu
>> > +Comment:   This is a per-CPU lock and it is used for VT-d 
>> > posted-interrupts.
>> > +   When VT-d posted-interrupts is supported and the VM has 
>> > assigned
>> > +   devices, we put the blocked vCPU on the list 
>> > blocked_vcpu_on_cpu
>> > +   protected by blocked_vcpu_on_cpu_lock, when VT-d hardware
>> issues
>> > +   wakeup notification event since external interrupts from 
>> > the
>> > +   assigned devices happens, we will find the vCPU on the 
>> > list to
>> > +

Re: [PATCH v9 17/18] KVM: Update Posted-Interrupts Descriptor when vCPU is blocked

2015-10-15 Thread Paolo Bonzini



On 15/10/2015 19:39, David Matlack wrote:
> But after spending more time reading the source code this morning I
> found that kvm_vcpu_check_block() eventually calls into
> vmx_sync_pir_to_irr(), which copies PIR to IRR and clears ON. And then
> apic_find_highest_irr() detects the pending posted interrupt.

Right.  And related to this, Feng, can you check if this is still
necessary on kvm/queue:

@@ -6518,6 +6523,20 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
kvm_vcpu_reload_apic_access_page(vcpu);
}
 
+   /*
+* KVM_REQ_EVENT is not set when posted interrupts are set by
+* VT-d hardware, so we have to update RVI unconditionally.
+*/
+   if (kvm_lapic_enabled(vcpu)) {
+   /*
+* Update architecture specific hints for APIC
+* virtual interrupt delivery.
+*/
+   if (kvm_x86_ops->hwapic_irr_update)
+   kvm_x86_ops->hwapic_irr_update(vcpu,
+   kvm_lapic_find_highest_irr(vcpu));
+   }
+
if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
kvm_apic_accept_events(vcpu);
if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) {
@@ -6534,13 +6553,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
kvm_x86_ops->enable_irq_window(vcpu);
 
if (kvm_lapic_enabled(vcpu)) {
-   /*
-* Update architecture specific hints for APIC
-* virtual interrupt delivery.
-*/
-   if (kvm_x86_ops->hwapic_irr_update)
-   kvm_x86_ops->hwapic_irr_update(vcpu,
-   kvm_lapic_find_highest_irr(vcpu));
update_cr8_intercept(vcpu);
kvm_lapic_sync_to_vapic(vcpu);
}


It may be obsolete now that we have the patch from Radim to set KVM_REQ_EVENT
in vmx_sync_pir_to_irr (http://permalink.gmane.org/gmane.linux.kernel/2057138).

Thanks,

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [edk2] KVM: MTRR: fix memory type handling if MTRR is completely disabled

2015-10-15 Thread Laszlo Ersek

On 10/15/15 18:53, Kinney, Michael D wrote:
> Laszlo,
> 
> There is already a PCD for this timeout that is used by CpuMpPei.
> 
>   gUefiCpuPkgTokenSpaceGuid.PcdCpuApInitTimeOutInMicroSeconds
> 
> I noticed that CpuDxe is using a hard coded AP timeout.  I think we should 
> just use this same PCD for both the PEI and DXE CPU module and then set it 
> for OVMF to the compatible value.

Perfect, thank you!

(I notice the default in the DEC file is 5, which is half of what
the DXE driver hardcodes.)

Now we only need a recommended (or experimental) value for it, and an
explanation why 100*1000 is no longer sufficient on KVM :)

Thanks!
Laszlo


> 
> Mike
> 
>> -Original Message-
>> From: edk2-devel [mailto:edk2-devel-boun...@lists.01.org] On Behalf Of
>> Laszlo Ersek
>> Sent: Thursday, October 15, 2015 9:19 AM
>> To: Xiao Guangrong
>> Cc: kvm@vger.kernel.org; Justen, Jordan L; edk2-de...@ml01.01.org; Alex
>> Williamson; Chen Fan; Paolo Bonzini; Wanpeng Li
>> Subject: Re: [edk2] KVM: MTRR: fix memory type handling if MTRR is
>> completely disabled
>>
>> CC'ing Jordan and Chen Fan.
>>
>> On 10/15/15 09:10, Xiao Guangrong wrote:
>>>
>>>
>>> On 10/15/2015 02:58 PM, Janusz wrote:
 W dniu 15.10.2015 o 08:41, Xiao Guangrong pisze:
>
>
> On 10/15/2015 02:19 PM, Janusz wrote:
>> W dniu 15.10.2015 o 06:19, Xiao Guangrong pisze:
>>>
>>>
>>>
>>> Well, the bug may be not in KVM. When this bug happened, i saw
>> OVMF
>>> only checked 1 CPU out, there is the log from OVMF's debug input:
>>>
>>> Flushing GCD
>>> Flushing GCD
>>> Flushing GCD
>>> Flushing GCD
>>> Flushing GCD
>>> Flushing GCD
>>> Flushing GCD
>>> Flushing GCD
>>> Flushing GCD
>>> Flushing GCDs
>>> Detect CPU count: 1
>>>
>>> So that the startup code has been freed however the APs are still
>>> running,
>>> i think that why we saw the vCPUs executed on unexpected address.
>>>
>>> After digging into OVMF's code, i noticed that BSP CPU waits for APs
>>> for a fixed timer period, however, KVM recent changes require zap all
>>> mappings if CR0.CD is changed, that means the APs need more time to
>>> startup.
>>>
>>> After following changes to OVMF, the bug is completely gone on my
>>> side:
>>>
>>> --- a/UefiCpuPkg/CpuDxe/ApStartup.c
>>> +++ b/UefiCpuPkg/CpuDxe/ApStartup.c
>>> @@ -454,7 +454,9 @@ StartApsStackless (
>>>  //
>>>  // Wait 100 milliseconds for APs to arrive at the ApEntryPoint
>>> routine
>>>  //
>>> -  MicroSecondDelay (100 * 1000);
>>> +  MicroSecondDelay (10 * 100 * 1000);
>>>
>>>  return EFI_SUCCESS;
>>>}
>>>
>>> Janusz, could you please check this instead? You can switch to your
>>> previous kernel to do this test.
>>>
>>>
>> Ok, now first time when I started VM I was able to start system
>> successfully. When I turned it off and started it again, it
>> restarted my
>> vm at system boot couple of times. Sometimes I also get very high cpu
>> usage for no reason. Also, I get less fps in GTA 5 than in kernel
>> 4.1, I
>> get something like 30-55, but on 4.1 I get all the time 60 fps. This is
>> my new log: https://bpaste.net/show/61a122ad7fe5
>>
>
> Just confirm: the Qemu internal error did not appear any more, right?
 Yes, when I reverted your first patch, switched to -vga std from -vga
 none and didn't passthrough my GPU (case when I got this internal
 error), vm started without problem. I even didn't get any VM restarts
 like with passthrough

>>>
>>> Wow, it seems we have fixed the QEMU internal error now. :)
>>>
>>> Recurrently, Paolo has reverted some MTRR patches, was your test
>>> based on these reverted patches?
>>>
>>> The GPU passthrough issue may be related to vfio (not sure), Alex, do
>>> you have any idea?
>>>
>>> Laszlo, could you please check the root case is reasonable and fix it in
>>> OVMF if it's right?
>>
>> The code that you have found is in edk2's EFI_MP_SERVICES_PROTOCOL
>> implementation -- more closely, its initial CPU counter code --, from
>> edk2 git commit 533263ee5a7f. It is not specific to OVMF -- it is
>> generic edk2 code for Intel processors. (I'm CC'ing Jordan and Chen Fan
>> because they authored the patch in question.)
>>
>> If VCPUs need more time to rendezvous than written in the code, on
>> recent KVM, then I think we should introduce a new FixedPCD in
>> UefiCpuPkg (practically: a compile time constant) for the timeout. Which
>> is not hard to do.
>>
>> However, we'll need two things:
>> - an idea about the concrete rendezvous timeout to set, from OvmfPkg
>>
>> - a *detailed* explanation / elaboration on your words:
>>
>>  "KVM recent changes require zap all mappings if CR0.CD is changed,
>>  that means the APs need more time to startup"
>>
>>  Preferably with

54 matches

Mail list logo