Re: [PATCH v2 0/5] vfio, dax: prevent long term filesystem-dax pins and other fixes

2018-02-23 Thread Haozhong Zhang
On 02/22/18 23:17 -0800, Dan Williams wrote:
> Changes since v1 [1]:
> 
> * Fix the detection of device-dax file instances in vma_is_fsdax().
>   (Haozhong, Gerd)
> 
> * Fix compile breakage in the FS_DAX=n and DEV_DAX=y case. (0day robot)
> 
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2018-February/014046.html
> 
> ---
> 
> The vfio interface, like RDMA, wants to setup long term (indefinite)
> pins of the pages backing an address range so that a guest or userspace
> driver can perform DMA to the with physical address. Given that this
> pinning may lead to filesystem operations deadlocking in the
> filesystem-dax case, the pinning request needs to be rejected.
> 
> The longer term fix for vfio, RDMA, and any other long term pin user, is
> to provide a 'pin with lease' mechanism. Similar to the leases that are
> hold for pNFS RDMA layouts, this userspace lease gives the kernel a way
> to notify userspace that the block layout of the file is changing and
> the kernel is revoking access to pinned pages.
> 
> ---
> 
> Dan Williams (5):
>   dax: fix vma_is_fsdax() helper
>   dax: fix dax_mapping() definition in the FS_DAX=n + DEV_DAX=y case
>   dax: fix S_DAX definition
>   dax: short circuit vma_is_fsdax() in the CONFIG_FS_DAX=n case
>   vfio: disable filesystem-dax page pinning
> 
> 
>  drivers/vfio/vfio_iommu_type1.c |   18 +++---
>  include/linux/dax.h |9 ++---
>  include/linux/fs.h  |6 --
>  3 files changed, 25 insertions(+), 8 deletions(-)

Tested on QEMU with fs-dax and device-dax as vNVDIMM backends
respectively with vfio passthrough. The fs-dax case fails QEMU as
expected, and the device-dax case works normally now.

Tested-by: Haozhong Zhang 



Re: VMs freezing when host is running 4.14

2018-02-13 Thread Haozhong Zhang
On 02/13/18 20:04 -0600, Josh Poimboeuf wrote:
> On Sun, Feb 11, 2018 at 02:39:41PM +0100, Marc Haber wrote:
> > Hi,
> > 
> > after in total nine weeks of bisecting, broken filesystems, service
> > outages (thankfully on unportant systems), 4.15 seems to have fixed the
> > issue. After going to 4.15, the crashes never happened again.
> > 
> > They have, however, happened with each and every 4.14 release I tried,
> > which I stopped doing with 4.14.15 on Jan 28.
> > 
> > This means, for me, that the issue is fixed and that I have just wasted
> > nine weeks of time.
> > 
> > For you, this means that you have a crippling, data-eating issue in the
> > current long-term releae kernel. I do sincerely hope that I never have
> > to lay my eye on any 4.14 kernel and hope that no major distribution
> > will release with this version.
> 
> I saw something similar today, also in kvm_async_pf_task_wait().  I had
> -tip in the guest (based on 4.16.0-rc1) and Fedora
> 4.14.16-300.fc27.x86_64 on the host.

Could you check whether this patch [1] on the host kernel fixes your
issue?  Paolo had sent it to the stable tree and it may take a while
to be merged in the next 4.14 stable release (perhaps in this week).

[1] https://patchwork.kernel.org/patch/10155177/

Haozhong

> 
> In my case the double fault seemed to be caused by a corrupt CR3.  The
> #DF hit when trying to call schedule() with the bad CR3, immediately
> after enabling interrupts.  So there could be something in the interrupt
> path which is corrupting CR3, but I audited that path in the guest
> kernel (which had PTI enabled) and it looks straightforward.
> 
> The reason I think CR3 was corrupt is because some earlier stack traces
> (which I forced as part of my unwinder testing) showed a kernel CR3
> value of 0x130ada005, but the #DF output showed it as 0x2212006.  And
> that also explains why a call instruction would result in a #DF.  But I
> have no clue *how* it got corrupted.
> 
> I don't know if the bug is in the host (4.14) or the guest (4.16).
> 
> [ 4031.436692] PANIC: double fault, error_code: 0x0
> [ 4031.439937] CPU: 1 PID: 1227 Comm: kworker/1:1 Not tainted 4.16.0-rc1+ #12
> [ 4031.440632] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.10.2-2.fc27 04/01/2014
> [ 4031.441475] Workqueue: events netstamp_clear
> [ 4031.441897] RIP: 0010:kvm_async_pf_task_wait+0x19d/0x250
> [ 4031.442411] RSP: :c9f5fbc0 EFLAGS: 0202
> [ 4031.442916] RAX: 880136048000 RBX: c9f5fbe0 RCX: 
> 0006
> [ 4031.443601] RDX: 0006 RSI: 880136048a40 RDI: 
> 880136048000
> [ 4031.444285] RBP: c9f5fc90 R08: 05212156f5cf R09: 
> 
> [ 4031.444966] R10:  R11:  R12: 
> c9f5fbf0
> [ 4031.445650] R13: 2e88 R14: 82ad6360 R15: 
> 
> [ 4031.446335] FS:  () GS:88013a80() 
> knlGS:
> [ 4031.447104] CS:  0010 DS:  ES:  CR0: 80050033
> [ 4031.447659] CR2: 6001 CR3: 02212006 CR4: 
> 001606e0
> [ 4031.448354] Call Trace:
> [ 4031.448602]  ? kvm_clock_read+0x1f/0x30
> [ 4031.448985]  ? prepare_to_swait+0x1d/0x70
> [ 4031.449384]  ? trace_hardirqs_off_thunk+0x1a/0x1c
> [ 4031.449845]  ? do_async_page_fault+0x67/0x90
> [ 4031.450283]  do_async_page_fault+0x67/0x90
> [ 4031.450684]  async_page_fault+0x25/0x50
> [ 4031.451067] RIP: 0010:text_poke+0x60/0x250
> [ 4031.451528] RSP: :c9f5fd78 EFLAGS: 00010286
> [ 4031.452082] RAX: ea00 RBX: ea05f200 RCX: 
> 817c88e9
> [ 4031.452843] RDX: 0001 RSI: c9f5fdbf RDI: 
> 817c88e4
> [ 4031.453568] RBP: 817c88e4 R08:  R09: 
> 0001
> [ 4031.454283] R10: c9f5fdf0 R11: 104eab8665f42bc7 R12: 
> 0001
> [ 4031.455010] R13: c9f5fdbf R14: 817c98e4 R15: 
> 
> [ 4031.455719]  ? dev_gro_receive+0x3f4/0x6f0
> [ 4031.456123]  ? netif_receive_skb_internal+0x24/0x380
> [ 4031.456641]  ? netif_receive_skb_internal+0x29/0x380
> [ 4031.457202]  ? netif_receive_skb_internal+0x24/0x380
> [ 4031.457743]  ? text_poke+0x28/0x250
> [ 4031.458084]  ? netif_receive_skb_internal+0x24/0x380
> [ 4031.458567]  ? netif_receive_skb_internal+0x25/0x380
> [ 4031.459046]  text_poke_bp+0x55/0xe0
> [ 4031.459393]  arch_jump_label_transform+0x90/0xf0
> [ 4031.459842]  __jump_label_update+0x63/0x70
> [ 4031.460243]  static_key_enable_cpuslocked+0x54/0x80
> [ 4031.460713]  static_key_enable+0x16/0x20
> [ 4031.461096]  process_one_work+0x266/0x6d0
> [ 4031.461506]  worker_thread+0x3a/0x390
> [ 4031.462328]  ? process_one_work+0x6d0/0x6d0
> [ 4031.463299]  kthread+0x121/0x140
> [ 4031.464122]  ? kthread_create_worker_on_cpu+0x70/0x70
> [ 4031.465070]  ret_from_fork+0x3a/0x50
> [ 4031.465859] Code: 89 58 08 4c 89 f7 49 89 9d 20 35 ad 82 48 89 95 58 ff ff 
> ff 4c 8d 63 10 e8 61 e4 8d 0

Re: [PATCH 3/3] vfio: disable filesystem-dax page pinning

2018-02-05 Thread Haozhong Zhang
Hi Dan,

On 02/04/18 15:05 -0800, Dan Williams wrote:
> Filesystem-DAX is incompatible with 'longterm' page pinning. Without
> page cache indirection a DAX mapping maps filesystem blocks directly.
> This means that the filesystem must not modify a file's block map while
> any page in a mapping is pinned. In order to prevent the situation of
> userspace holding of filesystem operations indefinitely, disallow
> 'longterm' Filesystem-DAX mappings.
> 
> RDMA has the same conflict and the plan there is to add a 'with lease'
> mechanism to allow the kernel to notify userspace that the mapping is
> being torn down for block-map maintenance. Perhaps something similar can
> be put in place for vfio.
> 
> Note that xfs and ext4 still report:
> 
>"DAX enabled. Warning: EXPERIMENTAL, use at your own risk"
> 
> ...at mount time, and resolving the dax-dma-vs-truncate problem is one
> of the last hurdles to remove that designation.
> 
> Cc: Alex Williamson 
> Cc: Michal Hocko 
> Cc: Christoph Hellwig 
> Cc: k...@vger.kernel.org
> Cc: 
> Reported-by: Haozhong Zhang 
> Fixes: d475c6346a38 ("dax,ext2: replace XIP read and write with DAX I/O")
> Signed-off-by: Dan Williams 
> ---
>  drivers/vfio/vfio_iommu_type1.c |   18 +++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index e30e29ae4819..45657e2b1ff7 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -338,11 +338,12 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
> long vaddr,
>  {
>   struct page *page[1];
>   struct vm_area_struct *vma;
> + struct vm_area_struct *vmas[1];
>   int ret;
>  
>   if (mm == current->mm) {
> - ret = get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE),
> -   page);
> + ret = get_user_pages_longterm(vaddr, 1, !!(prot & IOMMU_WRITE),
> +   page, vmas);
>   } else {
>   unsigned int flags = 0;
>  
> @@ -351,7 +352,18 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
> long vaddr,
>  
>   down_read(&mm->mmap_sem);
>   ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
> - NULL, NULL);
> + vmas, NULL);
> + /*
> +  * The lifetime of a vaddr_get_pfn() page pin is
> +  * userspace-controlled. In the fs-dax case this could
> +  * lead to indefinite stalls in filesystem operations.
> +  * Disallow attempts to pin fs-dax pages via this
> +  * interface.
> +  */
> + if (ret > 0 && vma_is_fsdax(vmas[0])) {
> + ret = -EOPNOTSUPP;
> + put_page(page[0]);
> + }
>   up_read(&mm->mmap_sem);
>   }
>  
> 

Besides this patch series, are there other patches needed to make
vma_is_fsdax() to work with device-dax?

I applied this patch series on the libvdimm-for-next branch of nvdimm
tree (ee95f4059a83), and found this patch series also failed
device-dax mapping with vfio. It can be reproduced by following steps:

1. Attach PCI device at BDF :03:10.2 to vfio-pci.
   # modprobe vfio-pci
   # lspci -n -s :03:10.2
   03:10.2 0200: 8086:1515 (rev 01)
   # echo :03:10.2 > /sys/bus/pci/devices/:06:0d.0/driver/unbind
   # echo 8086:1515 > /sys/bus/pci/drivers/vfio-pci/new_id

2. Use RAM to emulate NVDIMM and create a device-dax device /dev/dax0.0
   # cat /proc/iomem
   ...
   1-2 : Persistent Memory (legacy)
 1-2 : namespace0.0
   ...

   # ndctl create-namespace -f -e namespace0.0 -m dax
   {
 "dev":"namespace0.0",
 "mode":"dax",
 "size":8453619712,
 "uuid":"e1db00bc-f830-4f1b-ac18-091ae7df4f93",
 "daxdevs":[
   {
 "chardev":"dax0.0",
 "size":8453619712
   }
 ]
   }

3. Create a VM with assigned PCI device in step 1 and the device-dax
   device in step 2.
   # qemu-system-x86_64 -machine pc,accel=kvm,nvdimm=on -smp host \
-m 4G,slots=32,maxmem=128G \
-drive file=VM_DISK_IMG.img,format=raw,if=virtio \
-object 
memory-backend-file,id=nv_be1,share=on,mem-path=/dev/dax0.0,size=4G,align=2M \
-device nvdimm,id=nv1,memdev=nv_be1 \
-device ioh3420,id=root.

Re: [PATCH 3/3] vfio: disable filesystem-dax page pinning

2018-02-04 Thread Haozhong Zhang
On 02/04/18 15:05 -0800, Dan Williams wrote:
> Filesystem-DAX is incompatible with 'longterm' page pinning. Without
> page cache indirection a DAX mapping maps filesystem blocks directly.
> This means that the filesystem must not modify a file's block map while
> any page in a mapping is pinned. In order to prevent the situation of
> userspace holding of filesystem operations indefinitely, disallow
> 'longterm' Filesystem-DAX mappings.
> 
> RDMA has the same conflict and the plan there is to add a 'with lease'
> mechanism to allow the kernel to notify userspace that the mapping is
> being torn down for block-map maintenance. Perhaps something similar can
> be put in place for vfio.
> 
> Note that xfs and ext4 still report:
> 
>"DAX enabled. Warning: EXPERIMENTAL, use at your own risk"
> 
> ...at mount time, and resolving the dax-dma-vs-truncate problem is one
> of the last hurdles to remove that designation.
> 
> Cc: Alex Williamson 
> Cc: Michal Hocko 
> Cc: Christoph Hellwig 
> Cc: k...@vger.kernel.org
> Cc: 
> Reported-by: Haozhong Zhang 
> Fixes: d475c6346a38 ("dax,ext2: replace XIP read and write with DAX I/O")
> Signed-off-by: Dan Williams 
> ---
>  drivers/vfio/vfio_iommu_type1.c |   18 +++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index e30e29ae4819..45657e2b1ff7 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -338,11 +338,12 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
> long vaddr,
>  {
>   struct page *page[1];
>   struct vm_area_struct *vma;
> + struct vm_area_struct *vmas[1];
>   int ret;
>  
>   if (mm == current->mm) {
> - ret = get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE),
> -   page);
> + ret = get_user_pages_longterm(vaddr, 1, !!(prot & IOMMU_WRITE),
> +   page, vmas);

vmas is not used subsequently if this branch is taken, so can we use
NULL here?

Thanks,
Haozhong

>   } else {
>   unsigned int flags = 0;
>  
> @@ -351,7 +352,18 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
> long vaddr,
>  
>   down_read(&mm->mmap_sem);
>   ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
> - NULL, NULL);
> + vmas, NULL);
> + /*
> +  * The lifetime of a vaddr_get_pfn() page pin is
> +  * userspace-controlled. In the fs-dax case this could
> +  * lead to indefinite stalls in filesystem operations.
> +  * Disallow attempts to pin fs-dax pages via this
> +  * interface.
> +  */
> + if (ret > 0 && vma_is_fsdax(vmas[0])) {
> + ret = -EOPNOTSUPP;
> + put_page(page[0]);
> + }
>   up_read(&mm->mmap_sem);
>   }
>  
> 


Re: [PATCH v3 2/4] KVM: X86: Fix loss of exception which has not yet injected

2018-01-09 Thread Haozhong Zhang
On 01/09/18 00:57 -0800, Liran Alon wrote:
> 
> - haozhong.zh...@intel.com wrote:
> 
> > On 01/07/18 00:26 -0700, Ross Zwisler wrote:
> > > On Wed, Aug 23, 2017 at 10:21 PM, Wanpeng Li 
> > wrote:
> > > > From: Wanpeng Li 
> > > >
> > > > vmx_complete_interrupts() assumes that the exception is always
> > injected,
> > > > so it would be dropped by kvm_clear_exception_queue(). This patch
> > separates
> > > > exception.pending from exception.injected, exception.inject
> > represents the
> > > > exception is injected or the exception should be reinjected due to
> > vmexit
> > > > occurs during event delivery in VMX non-root operation.
> > exception.pending
> > > > represents the exception is queued and will be cleared when
> > injecting the
> > > > exception to the guest. So exception.pending and
> > exception.injected can
> > > > cooperate to guarantee exception will not be lost.
> > > >
> > > > Reported-by: Radim Krčmář 
> > > > Cc: Paolo Bonzini 
> > > > Cc: Radim Krčmář 
> > > > Signed-off-by: Wanpeng Li 
> > > > ---
> > > 
> > > I'm seeing a regression in my QEMU based NVDIMM testing system, and
> > I
> > > bisected it to this commit.
> > > 
> > > The behavior I'm seeing is that heavy I/O to simulated NVDIMMs in
> > > multiple virtual machines causes the QEMU guests to receive double
> > > faults, crashing them.  Here's an example backtrace:
> > > 
> > > [ 1042.653816] PANIC: double fault, error_code: 0x0
> > > [ 1042.654398] CPU: 2 PID: 30257 Comm: fsstress Not tainted
> > 4.15.0-rc5 #1
> > > [ 1042.655169] Hardware name: QEMU Standard PC (i440FX + PIIX,
> > 1996),
> > > BIOS 1.10.2-2.fc27 04/01/2014
> > > [ 1042.656121] RIP: 0010:memcpy_flushcache+0x4d/0x180
> > > [ 1042.656631] RSP: 0018:ac098c7d3808 EFLAGS: 00010286
> > > [ 1042.657245] RAX: ac0d18ca8000 RBX: 0fe0 RCX:
> > ac0d18ca8000
> > > [ 1042.658085] RDX: 921aaa5df000 RSI: 921aaa5e RDI:
> > 19f26e6c9000
> > > [ 1042.658802] RBP: 1000 R08:  R09:
> > 
> > > [ 1042.659503] R10:  R11:  R12:
> > 921aaa5df020
> > > [ 1042.660306] R13: ac0d18ca8000 R14: f4c102a977c0 R15:
> > 1000
> > > [ 1042.661132] FS:  7f71530b90c0()
> > GS:921b3b28()
> > > knlGS:
> > > [ 1042.662051] CS:  0010 DS:  ES:  CR0: 80050033
> > > [ 1042.662528] CR2: 01156002 CR3: 00012a936000 CR4:
> > 06e0
> > > [ 1042.663093] Call Trace:
> > > [ 1042.663329]  write_pmem+0x6c/0xa0 [nd_pmem]
> > > [ 1042.663668]  pmem_do_bvec+0x15f/0x330 [nd_pmem]
> > > [ 1042.664056]  ? kmem_alloc+0x61/0xe0 [xfs]
> > > [ 1042.664393]  pmem_make_request+0xdd/0x220 [nd_pmem]
> > > [ 1042.664781]  generic_make_request+0x11f/0x300
> > > [ 1042.665135]  ? submit_bio+0x6c/0x140
> > > [ 1042.665436]  submit_bio+0x6c/0x140
> > > [ 1042.665754]  ? next_bio+0x18/0x40
> > > [ 1042.666025]  ? _cond_resched+0x15/0x40
> > > [ 1042.666341]  submit_bio_wait+0x53/0x80
> > > [ 1042.666804]  blkdev_issue_zeroout+0xdc/0x210
> > > [ 1042.667336]  ? __dax_zero_page_range+0xb5/0x140
> > > [ 1042.667810]  __dax_zero_page_range+0xb5/0x140
> > > [ 1042.668197]  ? xfs_file_iomap_begin+0x2bd/0x8e0 [xfs]
> > > [ 1042.668611]  iomap_zero_range_actor+0x7c/0x1b0
> > > [ 1042.668974]  ? iomap_write_actor+0x170/0x170
> > > [ 1042.669318]  iomap_apply+0xa4/0x110
> > > [ 1042.669616]  ? iomap_write_actor+0x170/0x170
> > > [ 1042.669958]  iomap_zero_range+0x52/0x80
> > > [ 1042.670255]  ? iomap_write_actor+0x170/0x170
> > > [ 1042.670616]  xfs_setattr_size+0xd4/0x330 [xfs]
> > > [ 1042.670995]  xfs_ioc_space+0x27e/0x2f0 [xfs]
> > > [ 1042.671332]  ? terminate_walk+0x87/0xf0
> > > [ 1042.671662]  xfs_file_ioctl+0x862/0xa40 [xfs]
> > > [ 1042.672035]  ? _copy_to_user+0x22/0x30
> > > [ 1042.672346]  ? cp_new_stat+0x150/0x180
> > > [ 1042.672663]  do_vfs_ioctl+0xa1/0x610
> > > [ 1042.672960]  ? SYSC_newfstat+0x3c/0x60
> > > [ 1042.673264]  SyS_ioctl+0x74/0x80
> > > [ 1042.673661]  entry_SYSCALL_64_fastpath+0x1a/0x7d
> > > [ 1042.674239] RIP: 0033:0x7f71525a2dc7
> > > [ 1042.674681] RSP: 002b:7ffef97aa778 EFLAGS: 0246
> > ORIG_RAX:
> > > 0010
> > > [ 1042.675664] RAX: ffda RBX: 000112bc RCX:
> > 7f71525a2dc7
> > > [ 1042.676592] RDX: 7ffef97aa7a0 RSI: 40305825 RDI:
> > 0003
> > > [ 1042.677520] RBP: 0009 R08: 0045 R09:
> > 7ffef97aa78c
> > > [ 1042.678442] R10:  R11: 0246 R12:
> > 0003
> > > [ 1042.679330] R13: 00019e38 R14: 000fcca7 R15:
> > 0016
> > > [ 1042.680216] Code: 48 8d 5d e0 4c 8d 62 20 48 89 cf 48 29 d7 48
> > 89
> > > de 48 83 e6 e0 4c 01 e6 48 8d 04 17 4c 8b 02 4c 8b 4a 08 4c 8b 52
> > 10
> > > 4c 8b 5a 18 <4c> 0f c3 00 4c 0f c3 48 08 4c 0f c3 50 10 4c 0f c3 58
> > 18
> > > 48 83
> > > 
> > > This appears to be independent of both the guest kernel versio

Re: [PATCH v3 2/4] KVM: X86: Fix loss of exception which has not yet injected

2018-01-08 Thread Haozhong Zhang
On 01/07/18 00:26 -0700, Ross Zwisler wrote:
> On Wed, Aug 23, 2017 at 10:21 PM, Wanpeng Li  wrote:
> > From: Wanpeng Li 
> >
> > vmx_complete_interrupts() assumes that the exception is always injected,
> > so it would be dropped by kvm_clear_exception_queue(). This patch separates
> > exception.pending from exception.injected, exception.inject represents the
> > exception is injected or the exception should be reinjected due to vmexit
> > occurs during event delivery in VMX non-root operation. exception.pending
> > represents the exception is queued and will be cleared when injecting the
> > exception to the guest. So exception.pending and exception.injected can
> > cooperate to guarantee exception will not be lost.
> >
> > Reported-by: Radim Krčmář 
> > Cc: Paolo Bonzini 
> > Cc: Radim Krčmář 
> > Signed-off-by: Wanpeng Li 
> > ---
> 
> I'm seeing a regression in my QEMU based NVDIMM testing system, and I
> bisected it to this commit.
> 
> The behavior I'm seeing is that heavy I/O to simulated NVDIMMs in
> multiple virtual machines causes the QEMU guests to receive double
> faults, crashing them.  Here's an example backtrace:
> 
> [ 1042.653816] PANIC: double fault, error_code: 0x0
> [ 1042.654398] CPU: 2 PID: 30257 Comm: fsstress Not tainted 4.15.0-rc5 #1
> [ 1042.655169] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.10.2-2.fc27 04/01/2014
> [ 1042.656121] RIP: 0010:memcpy_flushcache+0x4d/0x180
> [ 1042.656631] RSP: 0018:ac098c7d3808 EFLAGS: 00010286
> [ 1042.657245] RAX: ac0d18ca8000 RBX: 0fe0 RCX: 
> ac0d18ca8000
> [ 1042.658085] RDX: 921aaa5df000 RSI: 921aaa5e RDI: 
> 19f26e6c9000
> [ 1042.658802] RBP: 1000 R08:  R09: 
> 
> [ 1042.659503] R10:  R11:  R12: 
> 921aaa5df020
> [ 1042.660306] R13: ac0d18ca8000 R14: f4c102a977c0 R15: 
> 1000
> [ 1042.661132] FS:  7f71530b90c0() GS:921b3b28()
> knlGS:
> [ 1042.662051] CS:  0010 DS:  ES:  CR0: 80050033
> [ 1042.662528] CR2: 01156002 CR3: 00012a936000 CR4: 
> 06e0
> [ 1042.663093] Call Trace:
> [ 1042.663329]  write_pmem+0x6c/0xa0 [nd_pmem]
> [ 1042.663668]  pmem_do_bvec+0x15f/0x330 [nd_pmem]
> [ 1042.664056]  ? kmem_alloc+0x61/0xe0 [xfs]
> [ 1042.664393]  pmem_make_request+0xdd/0x220 [nd_pmem]
> [ 1042.664781]  generic_make_request+0x11f/0x300
> [ 1042.665135]  ? submit_bio+0x6c/0x140
> [ 1042.665436]  submit_bio+0x6c/0x140
> [ 1042.665754]  ? next_bio+0x18/0x40
> [ 1042.666025]  ? _cond_resched+0x15/0x40
> [ 1042.666341]  submit_bio_wait+0x53/0x80
> [ 1042.666804]  blkdev_issue_zeroout+0xdc/0x210
> [ 1042.667336]  ? __dax_zero_page_range+0xb5/0x140
> [ 1042.667810]  __dax_zero_page_range+0xb5/0x140
> [ 1042.668197]  ? xfs_file_iomap_begin+0x2bd/0x8e0 [xfs]
> [ 1042.668611]  iomap_zero_range_actor+0x7c/0x1b0
> [ 1042.668974]  ? iomap_write_actor+0x170/0x170
> [ 1042.669318]  iomap_apply+0xa4/0x110
> [ 1042.669616]  ? iomap_write_actor+0x170/0x170
> [ 1042.669958]  iomap_zero_range+0x52/0x80
> [ 1042.670255]  ? iomap_write_actor+0x170/0x170
> [ 1042.670616]  xfs_setattr_size+0xd4/0x330 [xfs]
> [ 1042.670995]  xfs_ioc_space+0x27e/0x2f0 [xfs]
> [ 1042.671332]  ? terminate_walk+0x87/0xf0
> [ 1042.671662]  xfs_file_ioctl+0x862/0xa40 [xfs]
> [ 1042.672035]  ? _copy_to_user+0x22/0x30
> [ 1042.672346]  ? cp_new_stat+0x150/0x180
> [ 1042.672663]  do_vfs_ioctl+0xa1/0x610
> [ 1042.672960]  ? SYSC_newfstat+0x3c/0x60
> [ 1042.673264]  SyS_ioctl+0x74/0x80
> [ 1042.673661]  entry_SYSCALL_64_fastpath+0x1a/0x7d
> [ 1042.674239] RIP: 0033:0x7f71525a2dc7
> [ 1042.674681] RSP: 002b:7ffef97aa778 EFLAGS: 0246 ORIG_RAX:
> 0010
> [ 1042.675664] RAX: ffda RBX: 000112bc RCX: 
> 7f71525a2dc7
> [ 1042.676592] RDX: 7ffef97aa7a0 RSI: 40305825 RDI: 
> 0003
> [ 1042.677520] RBP: 0009 R08: 0045 R09: 
> 7ffef97aa78c
> [ 1042.678442] R10:  R11: 0246 R12: 
> 0003
> [ 1042.679330] R13: 00019e38 R14: 000fcca7 R15: 
> 0016
> [ 1042.680216] Code: 48 8d 5d e0 4c 8d 62 20 48 89 cf 48 29 d7 48 89
> de 48 83 e6 e0 4c 01 e6 48 8d 04 17 4c 8b 02 4c 8b 4a 08 4c 8b 52 10
> 4c 8b 5a 18 <4c> 0f c3 00 4c 0f c3 48 08 4c 0f c3 50 10 4c 0f c3 58 18
> 48 83
> 
> This appears to be independent of both the guest kernel version (this
> backtrace has v4.15.0-rc5, but I've seen it with other kernels) as
> well as independent of the host QMEU version (mine happens to be
> qemu-2.10.1-2.fc27 in Fedora 27).
> 
> The new behavior is due to this commit being present in the host OS
> kernel.  Prior to this commit I could fire up 4 VMs and run xfstests
> on my simulated NVDIMMs, but after this commit such testing results in
> multiple of my VMs crashing almost immediately.
> 
> Reproduction is very simple, at least on my development 

[PATCH v6 0/2] KVM: MMU: fix kvm_is_mmio_pfn()

2017-12-19 Thread Haozhong Zhang
Some reserved pages, such as those from NVDIMM DAX devices, are not
for MMIO, and can be mapped with cached memory type for better
performance. However, the above check misconceives those pages as
MMIO.  Because KVM maps MMIO pages with UC memory type, the
performance of guest accesses to those pages would be harmed.
Therefore, we check the host memory type in addition and only treat
UC/UC-/WC pages as MMIO.

Changes in v6:
 * Rename the function in patch 1 to pat_immune_to_uc_mtrr().
 * Consider WC memory type in patch 1.

Changes in v5:
 * Rename pat_pfn_is_uc() into pat_pfn_is_uc_or_uc_minus() to avoid
   confusion.
 * Drop converters between kvm_pfn_t and pfn_t, because they are not
   necessary. pat_pfn_is_uc_or_uc_minus() does not need flags in
   pfn_t, so we can only pass a raw unsigned long to it.

Changes in v4:
 * Mask pfn_t and kvm_pfn_t specific flags in conversion.

Changes in v3:
 * Move cache mode check to pat.c as pat_pfn_is_uc()
 * Reintroduce converters between kvm_pfn_t and pfn_t.

Changes in v2:
 * Switch to lookup_memtype() to get host memory type.
 * Rewrite the comment in KVM MMU patch.
 * Remove v1 patch 2, which is not necessary in v2.

Haozhong Zhang (2):
  x86/mm: add a function to check if a pfn is UC/UC-/WC
  KVM: MMU: consider host cache mode in MMIO page check

 arch/x86/include/asm/pat.h |  2 ++
 arch/x86/kvm/mmu.c | 13 -
 arch/x86/mm/pat.c  | 19 +++
 3 files changed, 33 insertions(+), 1 deletion(-)

-- 
2.14.1



[PATCH v6 2/2] KVM: MMU: consider host cache mode in MMIO page check

2017-12-19 Thread Haozhong Zhang
Some reserved pages, such as those from NVDIMM DAX devices, are not
for MMIO, and can be mapped with cached memory type for better
performance. However, the above check misconceives those pages as
MMIO.  Because KVM maps MMIO pages with UC memory type, the
performance of guest accesses to those pages would be harmed.
Therefore, we check the host memory type in addition and only treat
UC/UC-/WC pages as MMIO.

Signed-off-by: Haozhong Zhang 
Reported-by: Cuevas Escareno, Ivan D 
Reported-by: Kumar, Karthik 
Reviewed-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 89da688784fa..e3b9998b3355 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2708,7 +2708,18 @@ static bool mmu_need_write_protect(struct kvm_vcpu 
*vcpu, gfn_t gfn,
 static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
 {
if (pfn_valid(pfn))
-   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
+   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) &&
+   /*
+* Some reserved pages, such as those from NVDIMM
+* DAX devices, are not for MMIO, and can be mapped
+* with cached memory type for better performance.
+* However, the above check misconceives those pages
+* as MMIO, and results in KVM mapping them with UC
+* memory type, which would hurt the performance.
+* Therefore, we check the host memory type in addition
+* and only treat UC/UC-/WC pages as MMIO.
+*/
+   (!pat_enabled() || pat_immune_to_uc_mtrr(pfn));
 
return true;
 }
-- 
2.14.1



[PATCH v6 1/2] x86/mm: add a function to check if a pfn is UC/UC-/WCee

2017-12-19 Thread Haozhong Zhang
Check whether the PAT memory type of a pfn cannot be overridden by
MTRR UC memory type, i.e. the PAT memory type is UC, UC- or WC. This
function will be used by KVM to determine whether it needs to map a
host pfn to guest with UC memory type.

Signed-off-by: Haozhong Zhang 
Reviewed-by: Xiao Guangrong 
---
 arch/x86/include/asm/pat.h |  2 ++
 arch/x86/mm/pat.c  | 19 +++
 2 files changed, 21 insertions(+)

diff --git a/arch/x86/include/asm/pat.h b/arch/x86/include/asm/pat.h
index 8a3ee355b422..9a217a18523b 100644
--- a/arch/x86/include/asm/pat.h
+++ b/arch/x86/include/asm/pat.h
@@ -22,4 +22,6 @@ int io_reserve_memtype(resource_size_t start, resource_size_t 
end,
 
 void io_free_memtype(resource_size_t start, resource_size_t end);
 
+bool pat_immune_to_uc_mtrr(unsigned long pfn);
+
 #endif /* _ASM_X86_PAT_H */
diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
index fe7d57a8fb60..2231a84c3d34 100644
--- a/arch/x86/mm/pat.c
+++ b/arch/x86/mm/pat.c
@@ -677,6 +677,25 @@ static enum page_cache_mode lookup_memtype(u64 paddr)
return rettype;
 }
 
+/**
+ * Check whether the PAT memory type of @pfn cannot be overridden by
+ * UC MTRR memory type.
+ *
+ * Only to be called when PAT is enabled.
+ *
+ * Returns true, if the PAT memory type of @pfn is UC, UC-, or WC.
+ * Returns false in other cases.
+ */
+bool pat_immune_to_uc_mtrr(unsigned long pfn)
+{
+   enum page_cache_mode cm = lookup_memtype(PFN_PHYS(pfn));
+
+   return cm == _PAGE_CACHE_MODE_UC ||
+  cm == _PAGE_CACHE_MODE_UC_MINUS ||
+  cm == _PAGE_CACHE_MODE_WC;
+}
+EXPORT_SYMBOL_GPL(pat_immune_to_uc_mtrr);
+
 /**
  * io_reserve_memtype - Request a memory type mapping for a region of memory
  * @start: start (physical address) of the region
-- 
2.14.1



Re: [PATCH v5 1/2] x86/mm: add a function to check if a pfn is UC/UC-

2017-12-18 Thread Haozhong Zhang
On 12/18/17 13:55 +0100, Paolo Bonzini wrote:
> On 08/11/2017 08:56, Haozhong Zhang wrote:
> > +bool pat_pfn_is_uc_or_uc_minus(unsigned long pfn)
> > +{
> > +   enum page_cache_mode cm = lookup_memtype(PFN_PHYS(pfn));
> > +
> > +   return cm == _PAGE_CACHE_MODE_UC || cm == _PAGE_CACHE_MODE_UC_MINUS;
> > +}
> > +EXPORT_SYMBOL_GPL(pat_pfn_is_uc_or_uc_minus);
> > +
> 
> As discussed in the reply to v2, this should include WC too.  The
> function name could become something like pat_pfn_downgraded_by_uc_mtrr.

Or shall we just expose lookup_memtype(), and keep all other logic in
KVM? The function name still looks strange somehow, and the check of
memory type makes more sense and would be easier to understand in the
context of KVM.

Haozhong


Re: [PATCH v5 1/2] x86/mm: add a function to check if a pfn is UC/UC-

2017-11-15 Thread Haozhong Zhang
On 11/15/17 11:44 +0100, David Hildenbrand wrote:
> On 08.11.2017 08:56, Haozhong Zhang wrote:
> > It will be used by KVM to check whether a pfn should be
> > mapped to guest as UC.
> > 
> > Signed-off-by: Haozhong Zhang 
> > ---
> >  arch/x86/include/asm/pat.h |  2 ++
> >  arch/x86/mm/pat.c  | 16 
> >  2 files changed, 18 insertions(+)
> > 
> > diff --git a/arch/x86/include/asm/pat.h b/arch/x86/include/asm/pat.h
> > index fffb2794dd89..fabb0cf00e77 100644
> > --- a/arch/x86/include/asm/pat.h
> > +++ b/arch/x86/include/asm/pat.h
> > @@ -21,4 +21,6 @@ int io_reserve_memtype(resource_size_t start, 
> > resource_size_t end,
> >  
> >  void io_free_memtype(resource_size_t start, resource_size_t end);
> >  
> > +bool pat_pfn_is_uc_or_uc_minus(unsigned long pfn);
> > +
> >  #endif /* _ASM_X86_PAT_H */
> > diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
> > index fe7d57a8fb60..e1282dd4eeb8 100644
> > --- a/arch/x86/mm/pat.c
> > +++ b/arch/x86/mm/pat.c
> > @@ -677,6 +677,22 @@ static enum page_cache_mode lookup_memtype(u64 paddr)
> > return rettype;
> >  }
> >  
> > +/**
> > + * Check with PAT whether the memory type of a pfn is UC or UC-.
> > + *
> > + * Only to be called when PAT is enabled.
> > + *
> > + * Returns true, if the memory type of @pfn is UC or UC-.
> > + * Otherwise, returns false.
> > + */
> > +bool pat_pfn_is_uc_or_uc_minus(unsigned long pfn)
> > +{
> > +   enum page_cache_mode cm = lookup_memtype(PFN_PHYS(pfn));
> > +
> > +   return cm == _PAGE_CACHE_MODE_UC || cm == _PAGE_CACHE_MODE_UC_MINUS;
> > +}
> > +EXPORT_SYMBOL_GPL(pat_pfn_is_uc_or_uc_minus);
> > +
> >  /**
> >   * io_reserve_memtype - Request a memory type mapping for a region of 
> > memory
> >   * @start: start (physical address) of the region
> > 
> 
> Wonder if we should check for pat internally. And if we should simply
> return the memtype via lookup_memtype() instead of creating such a
> strange named function (by providing e.g. a lookup_memtype() variant
> that can be called with !pat_enabled()).
>
> The caller can easily check against _PAGE_CACHE_MODE_UC ...
>

Yes, the better solution should work for both PAT enabled and disabled
cases, like what __vm_insert_mixed() does: use vma->vm_page_prot if
PAT is disabled, and refer to track_pfn_insert() in addition if PAT is
enabled.

The early RFC patch [1] got the cache mode in a similar way via a new
function kvm_vcpu_gfn_to_pgprot(). However, as explained in RFC, it
does not work, because the existing MMIO check (where kvm_vcpu_gfn_to_pgprot()
is called) in KVM is performed with a spinlock (vcpu->kvm->mmu_lock)
being taken, but kvm_vcpu_gfn_to_pgprot() has to touch a semaphore
(vcpu->kvm->mm->mmap_sem). Besides, KVM may prefetch and check MMIO of
other pfns within vcpu->kvm->mmu_lock, and the prefectched pfns cannot
be predicted in advance, which means we have to keep the MMIO check
within vcpu->kvm->mmu_lock.

Therefore, I only make a suboptimal fix in this patchset that only
fixes PAT enabled cases, which I suppose is the usual usage scenario
of NVDIMM.


[1] https://patchwork.kernel.org/patch/10016261/


Haozhong


Re: [PATCH v5 1/2] x86/mm: add a function to check if a pfn is UC/UC-

2017-11-15 Thread Haozhong Zhang
On 11/15/17 07:17 -0800, Dan Williams wrote:
> On Tue, Nov 7, 2017 at 11:56 PM, Haozhong Zhang
>  wrote:
> > It will be used by KVM to check whether a pfn should be
> > mapped to guest as UC.
> >
> > Signed-off-by: Haozhong Zhang 
> > ---
> >  arch/x86/include/asm/pat.h |  2 ++
> >  arch/x86/mm/pat.c  | 16 
> >  2 files changed, 18 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/pat.h b/arch/x86/include/asm/pat.h
> > index fffb2794dd89..fabb0cf00e77 100644
> > --- a/arch/x86/include/asm/pat.h
> > +++ b/arch/x86/include/asm/pat.h
> > @@ -21,4 +21,6 @@ int io_reserve_memtype(resource_size_t start, 
> > resource_size_t end,
> >
> >  void io_free_memtype(resource_size_t start, resource_size_t end);
> >
> > +bool pat_pfn_is_uc_or_uc_minus(unsigned long pfn);
> > +
> >  #endif /* _ASM_X86_PAT_H */
> > diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
> > index fe7d57a8fb60..e1282dd4eeb8 100644
> > --- a/arch/x86/mm/pat.c
> > +++ b/arch/x86/mm/pat.c
> > @@ -677,6 +677,22 @@ static enum page_cache_mode lookup_memtype(u64 paddr)
> > return rettype;
> >  }
> >
> > +/**
> > + * Check with PAT whether the memory type of a pfn is UC or UC-.
> > + *
> > + * Only to be called when PAT is enabled.
> > + *
> > + * Returns true, if the memory type of @pfn is UC or UC-.
> > + * Otherwise, returns false.
> > + */
> > +bool pat_pfn_is_uc_or_uc_minus(unsigned long pfn)
> > +{
> > +   enum page_cache_mode cm = lookup_memtype(PFN_PHYS(pfn));
> > +
> > +   return cm == _PAGE_CACHE_MODE_UC || cm == _PAGE_CACHE_MODE_UC_MINUS;
> > +}
> > +EXPORT_SYMBOL_GPL(pat_pfn_is_uc_or_uc_minus);
> 
> Why do we need this strangely named new accessor? It seems to be
> open-coding a new / more limited version of track_pfn_insert().

In the first version patchset, KVM did extract and check the cache
mode got from track_pfn_insert(), but Ingo thought it was better to
keep all the low-level details out of KVM, so I encapsulated them in a
function in mm in subsequent versions.

Haozhong


[PATCH v5 2/2] KVM: MMU: consider host cache mode in MMIO page check

2017-11-07 Thread Haozhong Zhang
Some reserved pages, such as those from NVDIMM DAX devices, are not
for MMIO, and can be mapped with cached memory type for better
performance. However, the above check misconceives those pages as
MMIO.  Because KVM maps MMIO pages with UC memory type, the
performance of guest accesses to those pages would be harmed.
Therefore, we check the host memory type in addition and only treat
UC/UC- pages as MMIO.

Signed-off-by: Haozhong Zhang 
Reported-by: Cuevas Escareno, Ivan D 
Reported-by: Kumar, Karthik 
---
 arch/x86/kvm/mmu.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 0b481cc9c725..7715476bc5c9 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2708,7 +2708,20 @@ static bool mmu_need_write_protect(struct kvm_vcpu 
*vcpu, gfn_t gfn,
 static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
 {
if (pfn_valid(pfn))
-   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
+   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) &&
+   /*
+* Some reserved pages, such as those from
+* NVDIMM DAX devices, are not for MMIO, and
+* can be mapped with cached memory type for
+* better performance. However, the above
+* check misconceives those pages as MMIO.
+* Because KVM maps MMIO pages with UC memory
+* type, the performance of guest accesses to
+* those pages would be harmed. Therefore, we
+* check the host memory type in addition and
+* only treat UC/UC- pages as MMIO.
+*/
+   (!pat_enabled() || pat_pfn_is_uc_or_uc_minus(pfn));
 
return true;
 }
-- 
2.14.1



[PATCH v5 1/2] x86/mm: add a function to check if a pfn is UC/UC-

2017-11-07 Thread Haozhong Zhang
It will be used by KVM to check whether a pfn should be
mapped to guest as UC.

Signed-off-by: Haozhong Zhang 
---
 arch/x86/include/asm/pat.h |  2 ++
 arch/x86/mm/pat.c  | 16 
 2 files changed, 18 insertions(+)

diff --git a/arch/x86/include/asm/pat.h b/arch/x86/include/asm/pat.h
index fffb2794dd89..fabb0cf00e77 100644
--- a/arch/x86/include/asm/pat.h
+++ b/arch/x86/include/asm/pat.h
@@ -21,4 +21,6 @@ int io_reserve_memtype(resource_size_t start, resource_size_t 
end,
 
 void io_free_memtype(resource_size_t start, resource_size_t end);
 
+bool pat_pfn_is_uc_or_uc_minus(unsigned long pfn);
+
 #endif /* _ASM_X86_PAT_H */
diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
index fe7d57a8fb60..e1282dd4eeb8 100644
--- a/arch/x86/mm/pat.c
+++ b/arch/x86/mm/pat.c
@@ -677,6 +677,22 @@ static enum page_cache_mode lookup_memtype(u64 paddr)
return rettype;
 }
 
+/**
+ * Check with PAT whether the memory type of a pfn is UC or UC-.
+ *
+ * Only to be called when PAT is enabled.
+ *
+ * Returns true, if the memory type of @pfn is UC or UC-.
+ * Otherwise, returns false.
+ */
+bool pat_pfn_is_uc_or_uc_minus(unsigned long pfn)
+{
+   enum page_cache_mode cm = lookup_memtype(PFN_PHYS(pfn));
+
+   return cm == _PAGE_CACHE_MODE_UC || cm == _PAGE_CACHE_MODE_UC_MINUS;
+}
+EXPORT_SYMBOL_GPL(pat_pfn_is_uc_or_uc_minus);
+
 /**
  * io_reserve_memtype - Request a memory type mapping for a region of memory
  * @start: start (physical address) of the region
-- 
2.14.1



[PATCH v5 0/2] KVM: MMU: fix kvm_is_mmio_pfn()

2017-11-07 Thread Haozhong Zhang
Some reserved pages, such as those from NVDIMM DAX devices, are not
for MMIO, and can be mapped with cached memory type for better
performance. However, the above check misconceives those pages as
MMIO.  Because KVM maps MMIO pages with UC memory type, the
performance of guest accesses to those pages would be harmed.
Therefore, we check the host memory type in addition and only treat
UC/UC- pages as MMIO.

Changes in v5:
 * Rename pat_pfn_is_uc() into pat_pfn_is_uc_or_uc_minus() to avoid
   confusion.
 * Drop converters between kvm_pfn_t and pfn_t, because they are not
   necessary. pat_pfn_is_uc_or_uc_minus() does not need flags in
   pfn_t, so we can only pass a raw unsigned long to it.

Changes in v4:
 * Mask pfn_t and kvm_pfn_t specific flags in conversion.

Changes in v3:
 * Move cache mode check to pat.c as pat_pfn_is_uc()
 * Reintroduce converters between kvm_pfn_t and pfn_t.

Changes in v2:
 * Switch to lookup_memtype() to get host memory type.
 * Rewrite the comment in KVM MMU patch.
 * Remove v1 patch 2, which is not necessary in v2.


Haozhong Zhang (2):
  x86/mm: add functions to check if a pfn is UC/UC-
  KVM: MMU: consider host cache mode in MMIO page check

 arch/x86/include/asm/pat.h |  2 ++
 arch/x86/kvm/mmu.c | 15 ++-
 arch/x86/mm/pat.c  | 16 
 3 files changed, 32 insertions(+), 1 deletion(-)

-- 
2.14.1



Re: [PATCH v4 3/3] KVM: MMU: consider host cache mode in MMIO page check

2017-11-07 Thread Haozhong Zhang
On 11/03/17 16:51 +0800, Haozhong Zhang wrote:
> On 11/03/17 14:54 +0800, Xiao Guangrong wrote:
> > 
> > 
> > On 11/03/2017 01:53 PM, Haozhong Zhang wrote:
> > > Some reserved pages, such as those from NVDIMM DAX devices, are
> > > not for MMIO, and can be mapped with cached memory type for better
> > > performance. However, the above check misconceives those pages as
> > > MMIO.  Because KVM maps MMIO pages with UC memory type, the
> > > performance of guest accesses to those pages would be harmed.
> > > Therefore, we check the host memory type by lookup_memtype() in
> > > addition and only treat UC/UC- pages as MMIO.
> > > 
> > > Signed-off-by: Haozhong Zhang 
> > > Reported-by: Cuevas Escareno, Ivan D 
> > > Reported-by: Kumar, Karthik 
> > > ---
> > >   arch/x86/kvm/mmu.c | 19 ++-
> > >   1 file changed, 18 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > > index 0b481cc9c725..e9ed0e666a83 100644
> > > --- a/arch/x86/kvm/mmu.c
> > > +++ b/arch/x86/kvm/mmu.c
> > > @@ -2708,7 +2708,24 @@ static bool mmu_need_write_protect(struct kvm_vcpu 
> > > *vcpu, gfn_t gfn,
> > >   static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
> > >   {
> > >   if (pfn_valid(pfn))
> > > - return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
> > > + return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) &&
> > > + /*
> > > +  * Some reserved pages, such as those from
> > > +  * NVDIMM DAX devices, are not for MMIO, and
> > > +  * can be mapped with cached memory type for
> > > +  * better performance. However, the above
> > > +  * check misconceives those pages as MMIO.
> > > +  * Because KVM maps MMIO pages with UC memory
> > > +  * type, the performance of guest accesses to
> > > +  * those pages would be harmed. Therefore, we
> > > +  * check the host memory type in addition and
> > > +  * only treat UC/UC- pages as MMIO.
> > > +  *
> > > +  * pat_pfn_is_uc() works only when PAT is enabled,
> > > +  * so check pat_enabled() as well.
> > > +  */
> > > + (!pat_enabled() ||
> > > +  pat_pfn_is_uc(kvm_pfn_t_to_pfn_t(pfn)));
> > 
> > Can it be compiled if !CONFIG_PAT?
> 
> Yes.
> 
> What I check via pat_enabled() is not only whether PAT support is
> compiled, but also whether PAT is enabled at runtime.
> 
> > 
> > It would be better if we move pat_enabled out of kvm as well,
> 
> Surely I can combine them in one function like
> 
> bool pat_pfn_is_uc(pfn_t pfn)
> {
>   enum page_cache_mode cm;
> 
>   if (!pat_enabled())
>   return false;

I made a mistake: it should return true here. Then the semantics of
this function is confused. I think it's still better to leave
!pat_enabled() check in KVM.

Haozhong

> 
>   cm = lookup_memtype(pfn_t_to_phys(pfn));
> 
>   return cm == _PAGE_CACHE_MODE_UC || cm == _PAGE_CACHE_MODE_UC_MINUS;
> }
> 
> but I need a good name to make its semantics clear, or is it enough to
> just leave a comment like?
> 
> /*
>  * Check via PAT whether the cache mode of a page if UC or UC-.
>  *
>  * Returns true, if PAT is enabled and the cache mode is UC or UC-.
>  * Returns false otherwise.
>  */
> 
> 
> > please refer
> > to pgprot_writecombine() which is implemented in pat.c and in
> > include\asm-generic\pgtable.h:
> > 
> > #ifndef pgprot_writecombine
> > #define pgprot_writecombine pgprot_noncached
> > #endif
> >
> 
> 
> 


Re: [PATCH v4 2/3] KVM: add converters between pfn_t and kvm_pfn_t

2017-11-07 Thread Haozhong Zhang
On 11/03/17 06:21 -0700, Dan Williams wrote:
> On Thu, Nov 2, 2017 at 10:53 PM, Haozhong Zhang
>  wrote:
> > Signed-off-by: Haozhong Zhang 
> 
> Can you also add some text to the changelog saying why we need these 
> converters?

I'm going to drop these converters in the next version. The function
introduced in patch 1 does not need flags in pfn_t, so unsigned long
is enough. I'll change pfn_t to unsigned long in patch 1 as well.

Haozhong


Re: [PATCH v2 0/2] KVM: MMU: fix kvm_is_mmio_pfn()

2017-11-07 Thread Haozhong Zhang
On 11/03/17 10:15 -0400, Mikulas Patocka wrote:
> 
> 
> On Tue, 31 Oct 2017, Haozhong Zhang wrote:
> 
> > [I just copy the commit message from patch 2]
> > 
> > Some reserved pages, such as those from NVDIMM DAX devices, are
> > not for MMIO, and can be mapped with cached memory type for better
> > performance. However, the above check misconceives those pages as
> 
> Note that cached memory type on persistent memory has horrible 
> performance. The clwb instruction on Broadwell is very slow - when you 
> write to persistent memory and use clwb to flush cache, the performance is 
> about 350MB/s.

Wasn't clwb first introduced on Skylake?

Haozhong

> 
> Using write-combining memory type for persistent memory is much faster, it 
> can sustain performance of one 8-byte write per tick.
> 
> Mikulas
> 
> > MMIO.  Because KVM maps MMIO pages with UC memory type, the
> > performance of guest accesses to those pages would be harmed.
> > Therefore, we check the host memory type by lookup_memtype() in
> > addition and only treat UC/UC- pages as MMIO.
> > 
> > Changes in v2:
> >  * Switch to lookup_memtype() to get host memory type.
> >  * Rewrite the comment in KVM MMU patch.
> >  * Remove v1 patch 2, which is not necessary in v2.
> > 
> > Haozhong Zhang (2):
> >   x86/mm: expose lookup_memtype()
> >   KVM: MMU: consider host cache mode in MMIO page check
> > 
> >  arch/x86/include/asm/pat.h |  2 ++
> >  arch/x86/kvm/mmu.c | 30 +++---
> >  arch/x86/mm/pat.c  |  3 ++-
> >  3 files changed, 31 insertions(+), 4 deletions(-)
> > 
> > -- 
> > 2.14.1
> > 


Re: [PATCH v4 3/3] KVM: MMU: consider host cache mode in MMIO page check

2017-11-03 Thread Haozhong Zhang
On 11/03/17 17:24 +0800, Xiao Guangrong wrote:
> 
> 
> On 11/03/2017 05:02 PM, Haozhong Zhang wrote:
> > On 11/03/17 16:51 +0800, Haozhong Zhang wrote:
> > > On 11/03/17 14:54 +0800, Xiao Guangrong wrote:
> > > > 
> > > > 
> > > > On 11/03/2017 01:53 PM, Haozhong Zhang wrote:
> > > > > Some reserved pages, such as those from NVDIMM DAX devices, are
> > > > > not for MMIO, and can be mapped with cached memory type for better
> > > > > performance. However, the above check misconceives those pages as
> > > > > MMIO.  Because KVM maps MMIO pages with UC memory type, the
> > > > > performance of guest accesses to those pages would be harmed.
> > > > > Therefore, we check the host memory type by lookup_memtype() in
> > > > > addition and only treat UC/UC- pages as MMIO.
> > > > > 
> > > > > Signed-off-by: Haozhong Zhang 
> > > > > Reported-by: Cuevas Escareno, Ivan D 
> > > > > 
> > > > > Reported-by: Kumar, Karthik 
> > > > > ---
> > > > >arch/x86/kvm/mmu.c | 19 ++-
> > > > >1 file changed, 18 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > > > > index 0b481cc9c725..e9ed0e666a83 100644
> > > > > --- a/arch/x86/kvm/mmu.c
> > > > > +++ b/arch/x86/kvm/mmu.c
> > > > > @@ -2708,7 +2708,24 @@ static bool mmu_need_write_protect(struct 
> > > > > kvm_vcpu *vcpu, gfn_t gfn,
> > > > >static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
> > > > >{
> > > > >   if (pfn_valid(pfn))
> > > > > - return !is_zero_pfn(pfn) && 
> > > > > PageReserved(pfn_to_page(pfn));
> > > > > + return !is_zero_pfn(pfn) && 
> > > > > PageReserved(pfn_to_page(pfn)) &&
> > > > > + /*
> > > > > +  * Some reserved pages, such as those from
> > > > > +  * NVDIMM DAX devices, are not for MMIO, and
> > > > > +  * can be mapped with cached memory type for
> > > > > +  * better performance. However, the above
> > > > > +  * check misconceives those pages as MMIO.
> > > > > +  * Because KVM maps MMIO pages with UC memory
> > > > > +  * type, the performance of guest accesses to
> > > > > +  * those pages would be harmed. Therefore, we
> > > > > +  * check the host memory type in addition and
> > > > > +  * only treat UC/UC- pages as MMIO.
> > > > > +  *
> > > > > +  * pat_pfn_is_uc() works only when PAT is 
> > > > > enabled,
> > > > > +  * so check pat_enabled() as well.
> > > > > +  */
> > > > > + (!pat_enabled() ||
> > > > > +  pat_pfn_is_uc(kvm_pfn_t_to_pfn_t(pfn)));
> > > > 
> > > > Can it be compiled if !CONFIG_PAT?
> > > 
> > > Yes.
> > > 
> > > What I check via pat_enabled() is not only whether PAT support is
> > > compiled, but also whether PAT is enabled at runtime.
> > > 
> > > > 
> > > > It would be better if we move pat_enabled out of kvm as well,
> > > 
> > > Surely I can combine them in one function like
> > > 
> > > bool pat_pfn_is_uc(pfn_t pfn)
> > > {
> > >   enum page_cache_mode cm;
> > > 
> > >   if (!pat_enabled())
> > >   return false;
> > > 
> > >   cm = lookup_memtype(pfn_t_to_phys(pfn));
> > > 
> > >   return cm == _PAGE_CACHE_MODE_UC || cm == _PAGE_CACHE_MODE_UC_MINUS;
> > > }
> > 
> > In addition, I think it's better to split this function into
> > pat_pfn_is_uc() and pat_pfn_is_uc_minus() to avoid additional
> > confusion.
> 
> Why not use pat_pfn_is_uc_or_uc_minus(). :)

Just in case that other places other than KVM do not need both of them.


Re: [PATCH v4 3/3] KVM: MMU: consider host cache mode in MMIO page check

2017-11-03 Thread Haozhong Zhang
On 11/03/17 17:10 +0800, Xiao Guangrong wrote:
> 
> 
> On 11/03/2017 04:51 PM, Haozhong Zhang wrote:
> > On 11/03/17 14:54 +0800, Xiao Guangrong wrote:
> > > 
> > > 
> > > On 11/03/2017 01:53 PM, Haozhong Zhang wrote:
> > > > Some reserved pages, such as those from NVDIMM DAX devices, are
> > > > not for MMIO, and can be mapped with cached memory type for better
> > > > performance. However, the above check misconceives those pages as
> > > > MMIO.  Because KVM maps MMIO pages with UC memory type, the
> > > > performance of guest accesses to those pages would be harmed.
> > > > Therefore, we check the host memory type by lookup_memtype() in
> > > > addition and only treat UC/UC- pages as MMIO.
> > > > 
> > > > Signed-off-by: Haozhong Zhang 
> > > > Reported-by: Cuevas Escareno, Ivan D 
> > > > Reported-by: Kumar, Karthik 
> > > > ---
> > > >arch/x86/kvm/mmu.c | 19 ++-
> > > >1 file changed, 18 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > > > index 0b481cc9c725..e9ed0e666a83 100644
> > > > --- a/arch/x86/kvm/mmu.c
> > > > +++ b/arch/x86/kvm/mmu.c
> > > > @@ -2708,7 +2708,24 @@ static bool mmu_need_write_protect(struct 
> > > > kvm_vcpu *vcpu, gfn_t gfn,
> > > >static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
> > > >{
> > > > if (pfn_valid(pfn))
> > > > -   return !is_zero_pfn(pfn) && 
> > > > PageReserved(pfn_to_page(pfn));
> > > > +   return !is_zero_pfn(pfn) && 
> > > > PageReserved(pfn_to_page(pfn)) &&
> > > > +   /*
> > > > +* Some reserved pages, such as those from
> > > > +* NVDIMM DAX devices, are not for MMIO, and
> > > > +* can be mapped with cached memory type for
> > > > +* better performance. However, the above
> > > > +* check misconceives those pages as MMIO.
> > > > +* Because KVM maps MMIO pages with UC memory
> > > > +* type, the performance of guest accesses to
> > > > +* those pages would be harmed. Therefore, we
> > > > +* check the host memory type in addition and
> > > > +* only treat UC/UC- pages as MMIO.
> > > > +*
> > > > +* pat_pfn_is_uc() works only when PAT is 
> > > > enabled,
> > > > +* so check pat_enabled() as well.
> > > > +*/
> > > > +   (!pat_enabled() ||
> > > > +pat_pfn_is_uc(kvm_pfn_t_to_pfn_t(pfn)));
> > > 
> > > Can it be compiled if !CONFIG_PAT?
> > 
> > Yes.
> > 
> > What I check via pat_enabled() is not only whether PAT support is
> > compiled, but also whether PAT is enabled at runtime.
> 
> The issue is about pat_pfn_is_uc() which is implemented only if CONFIG_PAT is
> enabled, but you used it here unconditionally.
> 
> I am not sure if gcc is smart enough to omit pat_pfn_is_uc() completely under
> this case. If you really have done the test to compile kernel and KVM module
> with CONFIG_PAT disabled, it is fine.
> 

I've done the test and it can compile.

arch/x86/mm/Makefile shows pat.c is compiled regardless of CONFIG_X86_PAT,
and pat_pfn_is_uc() is defined out of  #ifdef CONFIG_X86_PAT ... #endif.

Haozhong


Re: [PATCH v4 3/3] KVM: MMU: consider host cache mode in MMIO page check

2017-11-03 Thread Haozhong Zhang
On 11/03/17 16:51 +0800, Haozhong Zhang wrote:
> On 11/03/17 14:54 +0800, Xiao Guangrong wrote:
> > 
> > 
> > On 11/03/2017 01:53 PM, Haozhong Zhang wrote:
> > > Some reserved pages, such as those from NVDIMM DAX devices, are
> > > not for MMIO, and can be mapped with cached memory type for better
> > > performance. However, the above check misconceives those pages as
> > > MMIO.  Because KVM maps MMIO pages with UC memory type, the
> > > performance of guest accesses to those pages would be harmed.
> > > Therefore, we check the host memory type by lookup_memtype() in
> > > addition and only treat UC/UC- pages as MMIO.
> > > 
> > > Signed-off-by: Haozhong Zhang 
> > > Reported-by: Cuevas Escareno, Ivan D 
> > > Reported-by: Kumar, Karthik 
> > > ---
> > >   arch/x86/kvm/mmu.c | 19 ++-
> > >   1 file changed, 18 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > > index 0b481cc9c725..e9ed0e666a83 100644
> > > --- a/arch/x86/kvm/mmu.c
> > > +++ b/arch/x86/kvm/mmu.c
> > > @@ -2708,7 +2708,24 @@ static bool mmu_need_write_protect(struct kvm_vcpu 
> > > *vcpu, gfn_t gfn,
> > >   static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
> > >   {
> > >   if (pfn_valid(pfn))
> > > - return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
> > > + return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) &&
> > > + /*
> > > +  * Some reserved pages, such as those from
> > > +  * NVDIMM DAX devices, are not for MMIO, and
> > > +  * can be mapped with cached memory type for
> > > +  * better performance. However, the above
> > > +  * check misconceives those pages as MMIO.
> > > +  * Because KVM maps MMIO pages with UC memory
> > > +  * type, the performance of guest accesses to
> > > +  * those pages would be harmed. Therefore, we
> > > +  * check the host memory type in addition and
> > > +  * only treat UC/UC- pages as MMIO.
> > > +  *
> > > +  * pat_pfn_is_uc() works only when PAT is enabled,
> > > +  * so check pat_enabled() as well.
> > > +  */
> > > + (!pat_enabled() ||
> > > +  pat_pfn_is_uc(kvm_pfn_t_to_pfn_t(pfn)));
> > 
> > Can it be compiled if !CONFIG_PAT?
> 
> Yes.
> 
> What I check via pat_enabled() is not only whether PAT support is
> compiled, but also whether PAT is enabled at runtime.
> 
> > 
> > It would be better if we move pat_enabled out of kvm as well,
> 
> Surely I can combine them in one function like
> 
> bool pat_pfn_is_uc(pfn_t pfn)
> {
>   enum page_cache_mode cm;
> 
>   if (!pat_enabled())
>   return false;
> 
>   cm = lookup_memtype(pfn_t_to_phys(pfn));
> 
>   return cm == _PAGE_CACHE_MODE_UC || cm == _PAGE_CACHE_MODE_UC_MINUS;
> }

In addition, I think it's better to split this function into
pat_pfn_is_uc() and pat_pfn_is_uc_minus() to avoid additional
confusion.

Haozhong

> 
> but I need a good name to make its semantics clear, or is it enough to
> just leave a comment like?
> 
> /*
>  * Check via PAT whether the cache mode of a page if UC or UC-.
>  *
>  * Returns true, if PAT is enabled and the cache mode is UC or UC-.
>  * Returns false otherwise.
>  */
> 
> 
> > please refer
> > to pgprot_writecombine() which is implemented in pat.c and in
> > include\asm-generic\pgtable.h:
> > 
> > #ifndef pgprot_writecombine
> > #define pgprot_writecombine pgprot_noncached
> > #endif
> >
> 
> 
> 


Re: [PATCH v4 3/3] KVM: MMU: consider host cache mode in MMIO page check

2017-11-03 Thread Haozhong Zhang
On 11/03/17 14:54 +0800, Xiao Guangrong wrote:
> 
> 
> On 11/03/2017 01:53 PM, Haozhong Zhang wrote:
> > Some reserved pages, such as those from NVDIMM DAX devices, are
> > not for MMIO, and can be mapped with cached memory type for better
> > performance. However, the above check misconceives those pages as
> > MMIO.  Because KVM maps MMIO pages with UC memory type, the
> > performance of guest accesses to those pages would be harmed.
> > Therefore, we check the host memory type by lookup_memtype() in
> > addition and only treat UC/UC- pages as MMIO.
> > 
> > Signed-off-by: Haozhong Zhang 
> > Reported-by: Cuevas Escareno, Ivan D 
> > Reported-by: Kumar, Karthik 
> > ---
> >   arch/x86/kvm/mmu.c | 19 ++-
> >   1 file changed, 18 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index 0b481cc9c725..e9ed0e666a83 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -2708,7 +2708,24 @@ static bool mmu_need_write_protect(struct kvm_vcpu 
> > *vcpu, gfn_t gfn,
> >   static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
> >   {
> > if (pfn_valid(pfn))
> > -   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
> > +   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) &&
> > +   /*
> > +* Some reserved pages, such as those from
> > +* NVDIMM DAX devices, are not for MMIO, and
> > +* can be mapped with cached memory type for
> > +* better performance. However, the above
> > +* check misconceives those pages as MMIO.
> > +* Because KVM maps MMIO pages with UC memory
> > +* type, the performance of guest accesses to
> > +* those pages would be harmed. Therefore, we
> > +* check the host memory type in addition and
> > +* only treat UC/UC- pages as MMIO.
> > +*
> > +* pat_pfn_is_uc() works only when PAT is enabled,
> > +* so check pat_enabled() as well.
> > +*/
> > +   (!pat_enabled() ||
> > +pat_pfn_is_uc(kvm_pfn_t_to_pfn_t(pfn)));
> 
> Can it be compiled if !CONFIG_PAT?

Yes.

What I check via pat_enabled() is not only whether PAT support is
compiled, but also whether PAT is enabled at runtime.

> 
> It would be better if we move pat_enabled out of kvm as well,

Surely I can combine them in one function like

bool pat_pfn_is_uc(pfn_t pfn)
{
enum page_cache_mode cm;

if (!pat_enabled())
return false;

cm = lookup_memtype(pfn_t_to_phys(pfn));

return cm == _PAGE_CACHE_MODE_UC || cm == _PAGE_CACHE_MODE_UC_MINUS;
}

but I need a good name to make its semantics clear, or is it enough to
just leave a comment like?

/*
 * Check via PAT whether the cache mode of a page if UC or UC-.
 *
 * Returns true, if PAT is enabled and the cache mode is UC or UC-.
 * Returns false otherwise.
 */


> please refer
> to pgprot_writecombine() which is implemented in pat.c and in
> include\asm-generic\pgtable.h:
> 
> #ifndef pgprot_writecombine
> #define pgprot_writecombine pgprot_noncached
> #endif
>





[PATCH v4 2/3] KVM: add converters between pfn_t and kvm_pfn_t

2017-11-02 Thread Haozhong Zhang
Signed-off-by: Haozhong Zhang 
---
 include/linux/kvm_host.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 6882538eda32..caf6f7a6bdb2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -67,6 +68,9 @@
 #define KVM_PFN_ERR_HWPOISON   (KVM_PFN_ERR_MASK + 1)
 #define KVM_PFN_ERR_RO_FAULT   (KVM_PFN_ERR_MASK + 2)
 
+#define pfn_t_to_kvm_pfn_t(x)  pfn_t_to_pfn(x)
+#define kvm_pfn_t_to_pfn_t(x)  pfn_to_pfn_t((x) & ~KVM_PFN_ERR_NOSLOT_MASK)
+
 /*
  * error pfns indicate that the gfn is in slot but faild to
  * translate it to pfn on host.
-- 
2.14.1



[PATCH v4 3/3] KVM: MMU: consider host cache mode in MMIO page check

2017-11-02 Thread Haozhong Zhang
Some reserved pages, such as those from NVDIMM DAX devices, are
not for MMIO, and can be mapped with cached memory type for better
performance. However, the above check misconceives those pages as
MMIO.  Because KVM maps MMIO pages with UC memory type, the
performance of guest accesses to those pages would be harmed.
Therefore, we check the host memory type by lookup_memtype() in
addition and only treat UC/UC- pages as MMIO.

Signed-off-by: Haozhong Zhang 
Reported-by: Cuevas Escareno, Ivan D 
Reported-by: Kumar, Karthik 
---
 arch/x86/kvm/mmu.c | 19 ++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 0b481cc9c725..e9ed0e666a83 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2708,7 +2708,24 @@ static bool mmu_need_write_protect(struct kvm_vcpu 
*vcpu, gfn_t gfn,
 static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
 {
if (pfn_valid(pfn))
-   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
+   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) &&
+   /*
+* Some reserved pages, such as those from
+* NVDIMM DAX devices, are not for MMIO, and
+* can be mapped with cached memory type for
+* better performance. However, the above
+* check misconceives those pages as MMIO.
+* Because KVM maps MMIO pages with UC memory
+* type, the performance of guest accesses to
+* those pages would be harmed. Therefore, we
+* check the host memory type in addition and
+* only treat UC/UC- pages as MMIO.
+*
+* pat_pfn_is_uc() works only when PAT is enabled,
+* so check pat_enabled() as well.
+*/
+   (!pat_enabled() ||
+pat_pfn_is_uc(kvm_pfn_t_to_pfn_t(pfn)));
 
return true;
 }
-- 
2.14.1



[PATCH v4 1/3] x86/mm: add function to check if a pfn is UC/UC-

2017-11-02 Thread Haozhong Zhang
pat_pfn_is_uc(pfn) is added and will be used by KVM to check whether
the memory type of a host pfn is UC/UC-.

Signed-off-by: Haozhong Zhang 
---
 arch/x86/include/asm/pat.h |  2 ++
 arch/x86/mm/pat.c  | 14 ++
 2 files changed, 16 insertions(+)

diff --git a/arch/x86/include/asm/pat.h b/arch/x86/include/asm/pat.h
index fffb2794dd89..4027d9fb10a8 100644
--- a/arch/x86/include/asm/pat.h
+++ b/arch/x86/include/asm/pat.h
@@ -21,4 +21,6 @@ int io_reserve_memtype(resource_size_t start, resource_size_t 
end,
 
 void io_free_memtype(resource_size_t start, resource_size_t end);
 
+bool pat_pfn_is_uc(pfn_t pfn);
+
 #endif /* _ASM_X86_PAT_H */
diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
index fe7d57a8fb60..71fe9c29727e 100644
--- a/arch/x86/mm/pat.c
+++ b/arch/x86/mm/pat.c
@@ -677,6 +677,20 @@ static enum page_cache_mode lookup_memtype(u64 paddr)
return rettype;
 }
 
+/**
+ * Check whether the memory type of a pfn is UC or UC-. The result is
+ * valid only when PAT is enabled.
+ *
+ * Returns true if it is; otherwise, returns false.
+ */
+bool pat_pfn_is_uc(pfn_t pfn)
+{
+   enum page_cache_mode cm = lookup_memtype(pfn_t_to_phys(pfn));
+
+   return cm == _PAGE_CACHE_MODE_UC || cm == _PAGE_CACHE_MODE_UC_MINUS;
+}
+EXPORT_SYMBOL_GPL(pat_pfn_is_uc);
+
 /**
  * io_reserve_memtype - Request a memory type mapping for a region of memory
  * @start: start (physical address) of the region
-- 
2.14.1



[PATCH v4 0/3] KVM: MMU: fix kvm_is_mmio_pfn()

2017-11-02 Thread Haozhong Zhang
Some reserved pages, such as those from NVDIMM DAX devices, are
not for MMIO, and can be mapped with cached memory type for better
performance. However, the above check misconceives those pages as
MMIO.  Because KVM maps MMIO pages with UC memory type, the
performance of guest accesses to those pages would be harmed.
Therefore, we check the host memory type by lookup_memtype() in
addition and only treat UC/UC- pages as MMIO.


Changes in v4:
 * Mask pfn_t and kvm_pfn_t specific flags in conversion.

Changes in v3:
 * Move cache mode check to pat.c as pat_pfn_is_uc()
 * Reintroduce converters between kvm_pfn_t and pfn_t.

Changes in v2:
 * Switch to lookup_memtype() to get host memory type.
 * Rewrite the comment in KVM MMU patch.
 * Remove v1 patch 2, which is not necessary in v2.


Haozhong Zhang (3):
  x86/mm: add function to check if a pfn is UC/UC-
  KVM: add converters between pfn_t and kvm_pfn_t
  KVM: MMU: consider host cache mode in MMIO page check

 arch/x86/include/asm/pat.h |  2 ++
 arch/x86/kvm/mmu.c | 19 ++-
 arch/x86/mm/pat.c  | 14 ++
 include/linux/kvm_host.h   |  4 
 4 files changed, 38 insertions(+), 1 deletion(-)

-- 
2.14.1



Re: [PATCH v3 2/3] KVM: add converters between pfn_t and kvm_pfn_t

2017-11-02 Thread Haozhong Zhang
On 11/02/17 19:25 -0700, Dan Williams wrote:
> On Thu, Nov 2, 2017 at 6:16 PM, Haozhong Zhang  
> wrote:
> > Signed-off-by: Haozhong Zhang 
> > ---
> >  include/linux/kvm_host.h | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 6882538eda32..759fe498c89e 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -67,6 +67,9 @@
> >  #define KVM_PFN_ERR_HWPOISON   (KVM_PFN_ERR_MASK + 1)
> >  #define KVM_PFN_ERR_RO_FAULT   (KVM_PFN_ERR_MASK + 2)
> >
> > +#define kvm_pfn_to_pfn(x) ((pfn_t){ .val = (x)})
> > +#define pfn_to_kvm_pfn(x) ((kvm_pfn_t)((x).val))
> 
> Shouldn't this mask off the kvm_pfn_t and  pfn_t specific bits when
> converting between the two?

Yes, I'll fix it.

Thanks,
Haozhong


[PATCH v3 0/3] KVM: MMU: fix kvm_is_mmio_pfn()

2017-11-02 Thread Haozhong Zhang
Some reserved pages, such as those from NVDIMM DAX devices, are
not for MMIO, and can be mapped with cached memory type for better
performance. However, the above check misconceives those pages as
MMIO.  Because KVM maps MMIO pages with UC memory type, the
performance of guest accesses to those pages would be harmed.
Therefore, we check the host memory type by lookup_memtype() in
addition and only treat UC/UC- pages as MMIO.


Changes in v3:
 * Move cache mode check to pat.c as pat_pfn_is_uc()
 * Reintroduce converters between kvm_pfn_t and pfn_t.

Changes in v2:
 * Switch to lookup_memtype() to get host memory type.
 * Rewrite the comment in KVM MMU patch.
 * Remove v1 patch 2, which is not necessary in v2.

Haozhong Zhang (3):
  x86/mm: add function to check if a pfn is UC/UC-
  KVM: add converters between pfn_t and kvm_pfn_t
  KVM: MMU: consider host cache mode in MMIO page check

 arch/x86/include/asm/pat.h |  2 ++
 arch/x86/kvm/mmu.c | 18 +-
 arch/x86/mm/pat.c  | 14 ++
 include/linux/kvm_host.h   |  3 +++
 4 files changed, 36 insertions(+), 1 deletion(-)

-- 
2.14.1



[PATCH v3 2/3] KVM: add converters between pfn_t and kvm_pfn_t

2017-11-02 Thread Haozhong Zhang
Signed-off-by: Haozhong Zhang 
---
 include/linux/kvm_host.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 6882538eda32..759fe498c89e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -67,6 +67,9 @@
 #define KVM_PFN_ERR_HWPOISON   (KVM_PFN_ERR_MASK + 1)
 #define KVM_PFN_ERR_RO_FAULT   (KVM_PFN_ERR_MASK + 2)
 
+#define kvm_pfn_to_pfn(x) ((pfn_t){ .val = (x)})
+#define pfn_to_kvm_pfn(x) ((kvm_pfn_t)((x).val))
+
 /*
  * error pfns indicate that the gfn is in slot but faild to
  * translate it to pfn on host.
-- 
2.14.1



[PATCH v3 1/3] x86/mm: add function to check if a pfn is UC/UC-

2017-11-02 Thread Haozhong Zhang
pat_pfn_is_uc(pfn) is added and will be used by KVM to check whether
the memory type of a host pfn is UC/UC-.

Signed-off-by: Haozhong Zhang 
---
 arch/x86/include/asm/pat.h |  2 ++
 arch/x86/mm/pat.c  | 14 ++
 2 files changed, 16 insertions(+)

diff --git a/arch/x86/include/asm/pat.h b/arch/x86/include/asm/pat.h
index fffb2794dd89..4027d9fb10a8 100644
--- a/arch/x86/include/asm/pat.h
+++ b/arch/x86/include/asm/pat.h
@@ -21,4 +21,6 @@ int io_reserve_memtype(resource_size_t start, resource_size_t 
end,
 
 void io_free_memtype(resource_size_t start, resource_size_t end);
 
+bool pat_pfn_is_uc(pfn_t pfn);
+
 #endif /* _ASM_X86_PAT_H */
diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
index fe7d57a8fb60..71fe9c29727e 100644
--- a/arch/x86/mm/pat.c
+++ b/arch/x86/mm/pat.c
@@ -677,6 +677,20 @@ static enum page_cache_mode lookup_memtype(u64 paddr)
return rettype;
 }
 
+/**
+ * Check whether the memory type of a pfn is UC or UC-. The result is
+ * valid only when PAT is enabled.
+ *
+ * Returns true if it is; otherwise, returns false.
+ */
+bool pat_pfn_is_uc(pfn_t pfn)
+{
+   enum page_cache_mode cm = lookup_memtype(pfn_t_to_phys(pfn));
+
+   return cm == _PAGE_CACHE_MODE_UC || cm == _PAGE_CACHE_MODE_UC_MINUS;
+}
+EXPORT_SYMBOL_GPL(pat_pfn_is_uc);
+
 /**
  * io_reserve_memtype - Request a memory type mapping for a region of memory
  * @start: start (physical address) of the region
-- 
2.14.1



[PATCH v3 3/3] KVM: MMU: consider host cache mode in MMIO page check

2017-11-02 Thread Haozhong Zhang
Some reserved pages, such as those from NVDIMM DAX devices, are
not for MMIO, and can be mapped with cached memory type for better
performance. However, the above check misconceives those pages as
MMIO.  Because KVM maps MMIO pages with UC memory type, the
performance of guest accesses to those pages would be harmed.
Therefore, we check the host memory type by lookup_memtype() in
addition and only treat UC/UC- pages as MMIO.

Signed-off-by: Haozhong Zhang 
Reported-by: Cuevas Escareno, Ivan D 
Reported-by: Kumar, Karthik 
---
 arch/x86/kvm/mmu.c | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 0b481cc9c725..058ff6442638 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2708,7 +2708,23 @@ static bool mmu_need_write_protect(struct kvm_vcpu 
*vcpu, gfn_t gfn,
 static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
 {
if (pfn_valid(pfn))
-   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
+   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) &&
+   /*
+* Some reserved pages, such as those from
+* NVDIMM DAX devices, are not for MMIO, and
+* can be mapped with cached memory type for
+* better performance. However, the above
+* check misconceives those pages as MMIO.
+* Because KVM maps MMIO pages with UC memory
+* type, the performance of guest accesses to
+* those pages would be harmed. Therefore, we
+* check the host memory type in addition and
+* only treat UC/UC- pages as MMIO.
+*
+* pat_pfn_is_uc() works only when PAT is enabled,
+* so check pat_enabled() as well.
+*/
+   (!pat_enabled() || pat_pfn_is_uc(kvm_pfn_to_pfn(pfn)));
 
return true;
 }
-- 
2.14.1



Re: [PATCH v2 0/2] KVM: MMU: fix kvm_is_mmio_pfn()

2017-11-02 Thread Haozhong Zhang
On 11/02/17 13:37 -0400, Konrad Rzeszutek Wilk wrote:
> On Tue, Oct 31, 2017 at 07:48:38PM +0800, Haozhong Zhang wrote:
> > [I just copy the commit message from patch 2]
> > 
> > Some reserved pages, such as those from NVDIMM DAX devices, are
> > not for MMIO, and can be mapped with cached memory type for better
> > performance. However, the above check misconceives those pages as
> > MMIO.  Because KVM maps MMIO pages with UC memory type, the
> > performance of guest accesses to those pages would be harmed.
> > Therefore, we check the host memory type by lookup_memtype() in
> > addition and only treat UC/UC- pages as MMIO.
> 
> Is there a specific workload you used to detect this?
>

Creating files on NVDIMM in VM is must slower than the same operation
on the baremetal. As the wrong EPT memory type (UC vs. WB) is used,
every guest access to NVDIMM would be slower than baremetal.

Haozhong

> Thanks!
> > 
> > Changes in v2:
> >  * Switch to lookup_memtype() to get host memory type.
> >  * Rewrite the comment in KVM MMU patch.
> >  * Remove v1 patch 2, which is not necessary in v2.
> > 
> > Haozhong Zhang (2):
> >   x86/mm: expose lookup_memtype()
> >   KVM: MMU: consider host cache mode in MMIO page check
> > 
> >  arch/x86/include/asm/pat.h |  2 ++
> >  arch/x86/kvm/mmu.c | 30 +++---
> >  arch/x86/mm/pat.c  |  3 ++-
> >  3 files changed, 31 insertions(+), 4 deletions(-)
> > 
> > -- 
> > 2.14.1
> > 


Re: [PATCH v2 2/2] KVM: MMU: consider host cache mode in MMIO page check

2017-11-02 Thread Haozhong Zhang
On 11/02/17 15:56 +0800, Xiao Guangrong wrote:
> 
> 
> On 10/31/2017 07:48 PM, Haozhong Zhang wrote:
> > Some reserved pages, such as those from NVDIMM DAX devices, are
> > not for MMIO, and can be mapped with cached memory type for better
> > performance. However, the above check misconceives those pages as
> > MMIO.  Because KVM maps MMIO pages with UC memory type, the
> > performance of guest accesses to those pages would be harmed.
> > Therefore, we check the host memory type by lookup_memtype() in
> > addition and only treat UC/UC- pages as MMIO.
> > 
> > Signed-off-by: Haozhong Zhang 
> > Reported-by: Cuevas Escareno, Ivan D 
> > Reported-by: Kumar, Karthik 
> > ---
> >   arch/x86/kvm/mmu.c | 30 +++---
> >   1 file changed, 27 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index 0b481cc9c725..206828d18857 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -2707,10 +2707,34 @@ static bool mmu_need_write_protect(struct kvm_vcpu 
> > *vcpu, gfn_t gfn,
> >   static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
> >   {
> > -   if (pfn_valid(pfn))
> > -   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
> > +   bool is_mmio = true;
> > -   return true;
> > +   if (pfn_valid(pfn)) {
> > +   is_mmio = !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
> > +
> > +   /*
> > +* Some reserved pages, such as those from NVDIMM DAX
> > +* devices, are not for MMIO, and can be mapped with
> > +* cached memory type for better performance. However,
> > +* the above check misconceives those pages as MMIO.
> > +* Because KVM maps MMIO pages with UC memory type,
> > +* the performance of guest accesses to those pages
> > +* would be harmed. Therefore, we check the host
> > +* memory type by lookup_memtype() in addition and
> > +* only treat UC/UC- pages as MMIO.
> > +*
> > +* lookup_memtype() works only when PAT is enabled, so
> > +* add pat_enabled() check here.
> > +*/
> > +   if (is_mmio && pat_enabled()) {
> > +   enum page_cache_mode cm = lookup_memtype(PFN_PHYS(pfn));
> > +
> > +   is_mmio = (cm == _PAGE_CACHE_MODE_UC ||
> > +  cm == _PAGE_CACHE_MODE_UC_MINUS);
> > +   }
> > +   }
> 
> You can move all of these detailed stuffs to pat.c and abstract them by 
> introducing
> a function, maybe named pat_pfn_is_uc(). I think this is what Ingo wants.
> 

OK, I'll move the cache mode check to a function in pat.c.

Let me wait for Paolo and others' comments. If no additional comments,
I'll send another version ASAP.

Thanks,
Haozhong


Re: [PATCH v2 0/2] KVM: MMU: fix kvm_is_mmio_pfn()

2017-11-01 Thread Haozhong Zhang
On 10/31/17 19:48 +0800, Haozhong Zhang wrote:
> [I just copy the commit message from patch 2]
> 
> Some reserved pages, such as those from NVDIMM DAX devices, are
> not for MMIO, and can be mapped with cached memory type for better
> performance. However, the above check misconceives those pages as
> MMIO.  Because KVM maps MMIO pages with UC memory type, the
> performance of guest accesses to those pages would be harmed.
> Therefore, we check the host memory type by lookup_memtype() in
> addition and only treat UC/UC- pages as MMIO.
> 
> Changes in v2:
>  * Switch to lookup_memtype() to get host memory type.
>  * Rewrite the comment in KVM MMU patch.
>  * Remove v1 patch 2, which is not necessary in v2.
> 
> Haozhong Zhang (2):
>   x86/mm: expose lookup_memtype()
>   KVM: MMU: consider host cache mode in MMIO page check
> 
>  arch/x86/include/asm/pat.h |  2 ++
>  arch/x86/kvm/mmu.c | 30 +++---
>  arch/x86/mm/pat.c  |  3 ++-
>  3 files changed, 31 insertions(+), 4 deletions(-)
> 
> -- 
> 2.14.1
> 

Hi Paolo,

This patchset fixed a performance drop issue when using NVDIMM on KVM,
so I think it's nice to have in 4.14. Can you help to review it?

Thanks,
Haozhong


Re: kvm: GPF in native_write_cr4

2017-10-31 Thread Haozhong Zhang
Hi Wanpeng,

On 10/31/17 19:10 +0800, Wanpeng Li wrote:
> 2017-10-31 17:59 GMT+08:00 Dmitry Vyukov :
> > Hello,
> >
> > I am seeing the following crash on upstream
> > 15f859ae5c43c7f0a064ed92d33f7a5bc5de6de0 (Oct 26).
> > Reproducer:
> > https://gist.githubusercontent.com/dvyukov/a9690f90c39c1e3b1b6c7acda2d5ef89/raw/33e07f3d6779005fc475764e0802e4a5aee8d0cf/gistfile1.txt
> > I run qemu with -append "kvm-intel.nested=1" -enable-kvm -cpu host. My
> > host cpu is E5-2690.
> >
> 
> I can't reproduce this w/ latest kvm/queue in both L0 and L1. In
> addition, there is a commit tries to fix cr4 recently.
> https://git.kernel.org/pub/scm/virt/kvm/kvm.git/commit/?id=8eb3f87d903168bdbd1222776a6b1e281f50513e

The calltrace in this bug report is the same as what I got before
above commit.

In the previous bug, L0 KVM misused L2 CR4 as L1 CR4. When L1 KVM
tried to clear L1 CR4.VMXE in L1 VM shutdown path, L0 KVM considered
L1 intended to clear/set other bits as well (because of the wrong L2
CR4 was used by L0 KVM as L1 CR4), but changes to extra bits may not
be allowed against other L1 states.

In my previous fix, I tried to fix one place of such L1/L2 CR4
misuse. If there is no other places of CR4 misuse, you may have a look
at the guest states checked by kvm_set_cr4() against guest CR4
changes, and check whether L1 and L2 versions of any of them are
misused.

It would make the debug easier if we can log which check fails in
kvm_set_cr4() when the calltrace appears (e.g., by adding printk
before return 1 in kvm_set_cr4()).


Haozhong



>  The testcast is complex, if the below strace log is as you expected?
> 
> execve("./a.out", ["./a.out"], [/* 32 vars */]) = 0
> uname({sysname="Linux", nodename="kernel", ...}) = 0
> brk(NULL)   = 0x1d42000
> brk(0x1d431c0)  = 0x1d431c0
> arch_prctl(ARCH_SET_FS, 0x1d42880)  = 0
> readlink("/proc/self/exe", "/home/kernel/a.out", 4096) = 18
> brk(0x1d641c0)  = 0x1d641c0
> brk(0x1d65000)  = 0x1d65000
> access("/etc/ld.so.nohwcap", F_OK)  = -1 ENOENT (No such file or 
> directory)
> mmap(0x2000, 11481088, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x2000
> openat(AT_FDCWD, "/dev/kvm", O_WRONLY)  = 3
> ioctl(3, KVM_CREATE_VM or LOGGER_GET_LOG_BUF_SIZE, 0) = 4
> ioctl(4, KVM_CREATE_VCPU, 0)= 5
> ioctl(4, KVM_SET_USER_MEMORY_REGION, 0x7fff5e6c1230) = 0
> ioctl(4, KVM_SET_USER_MEMORY_REGION, 0x7fff5e6c1230) = 0
> ioctl(4, KVM_SET_USER_MEMORY_REGION, 0x7fff5e6c1230) = 0
> ioctl(4, KVM_SET_USER_MEMORY_REGION, 0x7fff5e6c1230) = 0
> ioctl(4, KVM_SET_USER_MEMORY_REGION, 0x7fff5e6c1230) = 0
> ioctl(4, KVM_SET_USER_MEMORY_REGION, 0x7fff5e6c1230) = 0
> ioctl(4, KVM_SET_USER_MEMORY_REGION, 0x7fff5e6c1230) = 0
> ioctl(4, KVM_SET_USER_MEMORY_REGION, 0x7fff5e6c1230) = 0
> ioctl(4, KVM_SET_USER_MEMORY_REGION, 0x7fff5e6c1230) = 0
> ioctl(4, KVM_SET_USER_MEMORY_REGION, 0x7fff5e6c1230) = 0
> ioctl(4, KVM_SET_USER_MEMORY_REGION, 0x7fff5e6c1230) = 0
> ioctl(4, KVM_SET_USER_MEMORY_REGION, 0x7fff5e6c1230) = 0
> ioctl(4, KVM_SET_USER_MEMORY_REGION, 0x7fff5e6c1230) = 0
> ioctl(4, KVM_SET_USER_MEMORY_REGION, 0x7fff5e6c1230) = 0
> ioctl(4, KVM_SET_USER_MEMORY_REGION, 0x7fff5e6c1230) = 0
> ioctl(4, KVM_SET_USER_MEMORY_REGION, 0x7fff5e6c1230) = 0
> ioctl(4, KVM_SET_USER_MEMORY_REGION, 0x7fff5e6c1230) = 0
> ioctl(4, KVM_SET_USER_MEMORY_REGION, 0x7fff5e6c1230) = 0
> ioctl(4, KVM_SET_USER_MEMORY_REGION, 0x7fff5e6c1230) = 0
> ioctl(4, KVM_SET_USER_MEMORY_REGION, 0x7fff5e6c1230) = 0
> ioctl(4, KVM_SET_USER_MEMORY_REGION, 0x7fff5e6c1230) = 0
> ioctl(4, KVM_SET_USER_MEMORY_REGION, 0x7fff5e6c1230) = 0
> ioctl(4, KVM_SET_USER_MEMORY_REGION, 0x7fff5e6c1230) = 0
> ioctl(4, KVM_SET_USER_MEMORY_REGION, 0x7fff5e6c1230) = 0
> ioctl(4, KVM_SET_USER_MEMORY_REGION, 0x7fff5e6c1170) = 0
> ioctl(5, KVM_GET_SREGS, 0x7fff5e6c1330) = 0
> open("/dev/kvm", O_RDWR)= 6
> ioctl(6, KVM_GET_SUPPORTED_CPUID, 0x7fff5e6c1470) = 0
> ioctl(5, KVM_SET_CPUID2, 0x7fff5e6c1470) = 0
> close(6)= 0
> ioctl(5, KVM_SET_MSRS, 0x7fff5e6c0c30)  = 5
> ioctl(5, KVM_SET_SREGS, 0x7fff5e6c1330) = 0
> ioctl(5, KVM_SET_REGS, 0x7fff5e6c1230)  = 0
> mremap(0x20998000, 4096, 16384, MREMAP_MAYMOVE|MREMAP_FIXED,
> 0x200fa000) = 0x200fa000
> ioctl(5, KVM_RUN, 0)= 0
> mbind(0x2000, 8192, MPOL_DEFAULT 0x20001ff8, 2, MPOL_MF_MOVE) = 0
> exit_group(0)   = ?
> +++ exited with 0 +++
> i
> 
> Regards,
> Wanpeng Li
> 
> > general protection fault:  [#1] SMP KASAN
> > Modules linked in:
> > CPU: 1 PID: 3064 Comm: a.out Not tainted 4.14.0-rc6+ #11
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> > task: 880064c203c0 task.stack: 880066718000
> > RIP: 0010:native_write_cr4+0x4/0x10 arch/x86/include/asm/special_insns.h:75
> > RSP: 0018:88006671f598 EFLAGS: 00010097
> > RAX: 880064c203c0 RB

[PATCH v2 1/2] x86/mm: expose lookup_memtype()

2017-10-31 Thread Haozhong Zhang
KVM MMU will use it to get the cache mode of the host page.

Signed-off-by: Haozhong Zhang 
---
 arch/x86/include/asm/pat.h | 2 ++
 arch/x86/mm/pat.c  | 3 ++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pat.h b/arch/x86/include/asm/pat.h
index fffb2794dd89..990d955972b8 100644
--- a/arch/x86/include/asm/pat.h
+++ b/arch/x86/include/asm/pat.h
@@ -21,4 +21,6 @@ int io_reserve_memtype(resource_size_t start, resource_size_t 
end,
 
 void io_free_memtype(resource_size_t start, resource_size_t end);
 
+enum page_cache_mode lookup_memtype(u64 paddr);
+
 #endif /* _ASM_X86_PAT_H */
diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
index fe7d57a8fb60..e85987aaf5d5 100644
--- a/arch/x86/mm/pat.c
+++ b/arch/x86/mm/pat.c
@@ -650,7 +650,7 @@ int free_memtype(u64 start, u64 end)
  * Returns _PAGE_CACHE_MODE_WB, _PAGE_CACHE_MODE_WC, _PAGE_CACHE_MODE_UC_MINUS
  * or _PAGE_CACHE_MODE_WT.
  */
-static enum page_cache_mode lookup_memtype(u64 paddr)
+enum page_cache_mode lookup_memtype(u64 paddr)
 {
enum page_cache_mode rettype = _PAGE_CACHE_MODE_WB;
struct memtype *entry;
@@ -676,6 +676,7 @@ static enum page_cache_mode lookup_memtype(u64 paddr)
spin_unlock(&memtype_lock);
return rettype;
 }
+EXPORT_SYMBOL_GPL(lookup_memtype);
 
 /**
  * io_reserve_memtype - Request a memory type mapping for a region of memory
-- 
2.14.1



[PATCH v2 2/2] KVM: MMU: consider host cache mode in MMIO page check

2017-10-31 Thread Haozhong Zhang
Some reserved pages, such as those from NVDIMM DAX devices, are
not for MMIO, and can be mapped with cached memory type for better
performance. However, the above check misconceives those pages as
MMIO.  Because KVM maps MMIO pages with UC memory type, the
performance of guest accesses to those pages would be harmed.
Therefore, we check the host memory type by lookup_memtype() in
addition and only treat UC/UC- pages as MMIO.

Signed-off-by: Haozhong Zhang 
Reported-by: Cuevas Escareno, Ivan D 
Reported-by: Kumar, Karthik 
---
 arch/x86/kvm/mmu.c | 30 +++---
 1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 0b481cc9c725..206828d18857 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2707,10 +2707,34 @@ static bool mmu_need_write_protect(struct kvm_vcpu 
*vcpu, gfn_t gfn,
 
 static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
 {
-   if (pfn_valid(pfn))
-   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
+   bool is_mmio = true;
 
-   return true;
+   if (pfn_valid(pfn)) {
+   is_mmio = !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
+
+   /*
+* Some reserved pages, such as those from NVDIMM DAX
+* devices, are not for MMIO, and can be mapped with
+* cached memory type for better performance. However,
+* the above check misconceives those pages as MMIO.
+* Because KVM maps MMIO pages with UC memory type,
+* the performance of guest accesses to those pages
+* would be harmed. Therefore, we check the host
+* memory type by lookup_memtype() in addition and
+* only treat UC/UC- pages as MMIO.
+*
+* lookup_memtype() works only when PAT is enabled, so
+* add pat_enabled() check here.
+*/
+   if (is_mmio && pat_enabled()) {
+   enum page_cache_mode cm = lookup_memtype(PFN_PHYS(pfn));
+
+   is_mmio = (cm == _PAGE_CACHE_MODE_UC ||
+  cm == _PAGE_CACHE_MODE_UC_MINUS);
+   }
+   }
+
+   return is_mmio;
 }
 
 static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
-- 
2.14.1



[PATCH v2 0/2] KVM: MMU: fix kvm_is_mmio_pfn()

2017-10-31 Thread Haozhong Zhang
[I just copy the commit message from patch 2]

Some reserved pages, such as those from NVDIMM DAX devices, are
not for MMIO, and can be mapped with cached memory type for better
performance. However, the above check misconceives those pages as
MMIO.  Because KVM maps MMIO pages with UC memory type, the
performance of guest accesses to those pages would be harmed.
Therefore, we check the host memory type by lookup_memtype() in
addition and only treat UC/UC- pages as MMIO.

Changes in v2:
 * Switch to lookup_memtype() to get host memory type.
 * Rewrite the comment in KVM MMU patch.
 * Remove v1 patch 2, which is not necessary in v2.

Haozhong Zhang (2):
  x86/mm: expose lookup_memtype()
  KVM: MMU: consider host cache mode in MMIO page check

 arch/x86/include/asm/pat.h |  2 ++
 arch/x86/kvm/mmu.c | 30 +++---
 arch/x86/mm/pat.c  |  3 ++-
 3 files changed, 31 insertions(+), 4 deletions(-)

-- 
2.14.1



Re: [PATCH 3/3] KVM: MMU: consider host cache type in MMIO pfn check

2017-10-31 Thread Haozhong Zhang
On 10/27/17 10:40 +0200, Ingo Molnar wrote:
> 
> * Haozhong Zhang  wrote:
> 
> > By default, KVM treats a reserved page as for MMIO purpose, and maps
> > it to guest with UC memory type.  However, some reserved pages are not
> > for MMIO, such as pages of DAX device (e.g., /dev/daxX.Y). Mapping
> > them with UC memory type will harm the performance.  In order to
> > exclude those cases, we check the host cache mode in addition and only
> > treat UC/UC- pages as MMIO.
> > 
> > Signed-off-by: Haozhong Zhang 
> > Reported-by: Cuevas Escareno, Ivan D 
> > Reported-by: Kumar, Karthik 
> > ---
> >  arch/x86/kvm/mmu.c | 32 +---
> >  1 file changed, 29 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index 0b481cc9c725..d4c821a6df3d 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -2707,10 +2707,36 @@ static bool mmu_need_write_protect(struct kvm_vcpu 
> > *vcpu, gfn_t gfn,
> >  
> >  static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
> >  {
> > -   if (pfn_valid(pfn))
> > -   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
> > +   bool is_mmio = true;
> >  
> > -   return true;
> > +   if (pfn_valid(pfn)) {
> > +   is_mmio = !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
> > +
> > +   /*
> > +* By default, KVM treats a reserved page as for MMIO
> > +* purpose, and maps it to guest with UC memory type.
> > +* However, some reserved pages are not for MMIO, such
> > +* as pages of DAX device (e.g., /dev/daxX.Y). Mapping
> > +* them with UC memory type will harm the performance.
> > +* In order to exclude those cases, we check the host
> > +* cache mode in addition and only treat UC/UC- pages
> > +* as MMIO.
> > +*
> > +* track_pfn_insert() works only when PAT is enabled,
> > +* so add pat_enabled() here.
> > +*/
> > +   if (is_mmio && pat_enabled()) {
> > +   pgprot_t prot;
> > +   enum page_cache_mode cm;
> > +
> > +   track_pfn_insert(NULL, &prot, kvm_pfn_to_pfn(pfn));
> > +   cm = pgprot2cachemode(prot);
> > +   is_mmio = (cm == _PAGE_CACHE_MODE_UC ||
> > +  cm == _PAGE_CACHE_MODE_UC_MINUS);
> > +   }
> > +   }
> > +
> > +   return is_mmio;
> >  }
> >  
> >  static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
> 
> s/harm the performance
>  /harm performance> 
> 
> But I suspect the rest of the comment should be rewritten too to be more 
> fluid.

I'll refactor the comment.

> 
> Beyond that - I think instead of exposing these low level details a properly 
> named 
> helper function should be put into pat.c instead - and KVM can use that.
>

KVM only needs the memory type information, so lookup_memtype() in
pat.c is probably a better one to be exposed.

Haozhong


[PATCH 2/3] KVM: add converters between pfn_t and kvm_pfn_t

2017-10-26 Thread Haozhong Zhang
Signed-off-by: Haozhong Zhang 
---
 include/linux/kvm_host.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 6882538eda32..759fe498c89e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -67,6 +67,9 @@
 #define KVM_PFN_ERR_HWPOISON   (KVM_PFN_ERR_MASK + 1)
 #define KVM_PFN_ERR_RO_FAULT   (KVM_PFN_ERR_MASK + 2)
 
+#define kvm_pfn_to_pfn(x) ((pfn_t){ .val = (x)})
+#define pfn_to_kvm_pfn(x) ((kvm_pfn_t)((x).val))
+
 /*
  * error pfns indicate that the gfn is in slot but faild to
  * translate it to pfn on host.
-- 
2.14.1



[PATCH 1/3] x86/mm: expose track_pfn_insert()

2017-10-26 Thread Haozhong Zhang
KVM MMU will use it to get the cache mode of the host pfn.

Signed-off-by: Haozhong Zhang 
---
 arch/x86/mm/pat.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
index fe7d57a8fb60..cab593ea8956 100644
--- a/arch/x86/mm/pat.c
+++ b/arch/x86/mm/pat.c
@@ -998,6 +998,7 @@ void track_pfn_insert(struct vm_area_struct *vma, pgprot_t 
*prot, pfn_t pfn)
*prot = __pgprot((pgprot_val(*prot) & (~_PAGE_CACHE_MASK)) |
 cachemode2protval(pcm));
 }
+EXPORT_SYMBOL_GPL(track_pfn_insert);
 
 /*
  * untrack_pfn is called while unmapping a pfnmap for a region.
-- 
2.14.1



[PATCH 0/3] KVM: MMU: fix kvm_is_mmio_pfn()

2017-10-26 Thread Haozhong Zhang
[I just copy the commit message from patch 3]

By default, KVM treats a reserved page as for MMIO purpose, and maps
it to guest with UC memory type.  However, some reserved pages are not
for MMIO, such as pages of DAX device (e.g., /dev/daxX.Y). Mapping
them with UC memory type will harm the performance.  In order to
exclude those cases, we check the host cache mode in addition and only
treat UC/UC- pages as MMIO.

Haozhong Zhang (3):
  x86/mm: expose track_pfn_insert()
  KVM: add converters between pfn_t and kvm_pfn_t
  KVM: MMU: consider host cache type in MMIO pfn check

 arch/x86/kvm/mmu.c   | 32 +---
 arch/x86/mm/pat.c|  1 +
 include/linux/kvm_host.h |  3 +++
 3 files changed, 33 insertions(+), 3 deletions(-)

-- 
2.14.1



[PATCH 3/3] KVM: MMU: consider host cache type in MMIO pfn check

2017-10-26 Thread Haozhong Zhang
By default, KVM treats a reserved page as for MMIO purpose, and maps
it to guest with UC memory type.  However, some reserved pages are not
for MMIO, such as pages of DAX device (e.g., /dev/daxX.Y). Mapping
them with UC memory type will harm the performance.  In order to
exclude those cases, we check the host cache mode in addition and only
treat UC/UC- pages as MMIO.

Signed-off-by: Haozhong Zhang 
Reported-by: Cuevas Escareno, Ivan D 
Reported-by: Kumar, Karthik 
---
 arch/x86/kvm/mmu.c | 32 +---
 1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 0b481cc9c725..d4c821a6df3d 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2707,10 +2707,36 @@ static bool mmu_need_write_protect(struct kvm_vcpu 
*vcpu, gfn_t gfn,
 
 static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
 {
-   if (pfn_valid(pfn))
-   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
+   bool is_mmio = true;
 
-   return true;
+   if (pfn_valid(pfn)) {
+   is_mmio = !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
+
+   /*
+* By default, KVM treats a reserved page as for MMIO
+* purpose, and maps it to guest with UC memory type.
+* However, some reserved pages are not for MMIO, such
+* as pages of DAX device (e.g., /dev/daxX.Y). Mapping
+* them with UC memory type will harm the performance.
+* In order to exclude those cases, we check the host
+* cache mode in addition and only treat UC/UC- pages
+* as MMIO.
+*
+* track_pfn_insert() works only when PAT is enabled,
+* so add pat_enabled() here.
+*/
+   if (is_mmio && pat_enabled()) {
+   pgprot_t prot;
+   enum page_cache_mode cm;
+
+   track_pfn_insert(NULL, &prot, kvm_pfn_to_pfn(pfn));
+   cm = pgprot2cachemode(prot);
+   is_mmio = (cm == _PAGE_CACHE_MODE_UC ||
+  cm == _PAGE_CACHE_MODE_UC_MINUS);
+   }
+   }
+
+   return is_mmio;
 }
 
 static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
-- 
2.14.1



Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-20 Thread Haozhong Zhang

On 10/14/16 13:18 +0100, Andrew Cooper wrote:

On 14/10/16 08:08, Haozhong Zhang wrote:

On 10/13/16 20:33 +0100, Andrew Cooper wrote:

On 13/10/16 19:59, Dan Williams wrote:

On Thu, Oct 13, 2016 at 9:01 AM, Andrew Cooper
 wrote:

On 13/10/16 16:40, Dan Williams wrote:

On Thu, Oct 13, 2016 at 2:08 AM, Jan Beulich 
wrote:
[..]

I think we can do the similar for Xen, like to lay another pseudo
device on /dev/pmem and do the reservation, like 2. in my previous
reply.

Well, my opinion certainly doesn't count much here, but I
continue to
consider this a bad idea. For entities like drivers it may well be
appropriate, but I think there ought to be an independent concept
of "OS reserved", and in the Xen case this could then be shared
between hypervisor and Dom0 kernel. Or if we were to consider Dom0
"just a guest", things should even be the other way around: Xen gets
all of the OS reserved space, and Dom0 needs something custom.

You haven't made the case why Xen is special and other
applications of
persistent memory are not.

In a Xen system, Xen runs in the baremetal root-mode ring0, and
dom0 is
a VM running in ring1/3 with the nvdimm driver.  This is the opposite
way around to the KVM model.

Dom0, being the hardware domain, has default ownership of all the
hardware, but to gain access in the first place, it must request a
mapping from Xen.

This is where my understanding the Xen model breaks down.  Are you
saying dom0 can't access the persistent memory range unless the ring0
agent has metadata storage space for tracking what it maps into dom0?


No.  I am trying to point out that the current suggestion wont work, and
needs re-designing.

Xen *must* be able to properly configure mappings of the NVDIMM for
dom0, *without* modifying any content on the NVDIMM.  Otherwise, data
corruption will occur.

Whether this means no Xen metadata, or the metadata living elsewhere in
regular ram, such as the main frametable, is an implementation detail.




Once dom0 has a mapping of the nvdimm, the nvdimm driver can go to
work
and figure out what is on the DIMM, and which areas are safe to use.

I don't understand this ordering of events.  Dom0 needs to have a
mapping to even write the on-media structure to indicate a
reservation.  So, initial dom0 access can't depend on metadata
reservation already being present.


I agree.

Overall, I think the following is needed.

* Xen starts up.
** Xen might find some NVDIMM SPA/MFN ranges in the NFIT table, and
needs to note this information somehow.
** Xen might find some Type 7 E820 regions, and needs to note this
information somehow.


IIUC, this is to collect MFNs and no need to create frame table and
M2P at this stage. If so, what is different from ...


* Xen starts dom0.
* Once OSPM is running, a Xen component in Linux needs to collect and
report all NVDIMM SPA/MFN regions it knowns about.
** This covers the AML-only case, and the hotplug case.


... the MFNs reported here, especially that the former is a subset
(hotplug ones not included in the former) of latter.


Hopefully nothing.  However, Xen shouldn't exclusively rely on the dom0
when it is capable of working things out itself, (which can aid with
debugging one half of this arrangement).  Also, the MFNS found by Xen
alone can be present in the default memory map for dom0.



Sure, I'll add code to parsing NFIT in Xen to discover statically
plugged pmem mode NVDIMM and their MFNs.

By the default memory map for dom0, do you mean making
XENMEM_memory_map returns above MFNs in Dom0 E820?



(There is no E820 hole or SRAT entries to tell which address range is
reserved for hotplugged NVDIMM)


* Dom0 requests a mapping of the NVDIMMs via the usual mechanism.


Two questions:
1. Why is this request necessary? Even without such requests like what
  my current implementation, Dom0 can still access NVDIMM.


Can it?  (if so, great, but I don't think this holds in the general
case.)  Is that a side effect of the NVDIMM being covered by a hole in
the E820?


In my development environment, NVDIMM MFNs are not covered by any E820
entry and appear after RAM MFNs.

Can you explain more about this point? Why can it work if covered by
E820 hole?


The current logic for what dom0 may access by default is
somewhat ad-hoc, and I have a gut feeling that it won't work with E820
type 7 regions.



  Or do you mean Xen hypervisor should by default disallow Dom0 to
  access MFNs reported in previous step until they are requested?


No - I am not suggesting this.



2. Who initiates the requests? If it's the libnvdimm driver, that
  means we still need to introduce Xen specific code to the driver.

  Or the requests are issued by OSPM (or the Xen component you
  mentioned above) when they probe new dimms?

  For the latter, Dan, do you think it's acceptable in NFIT code to
  call the Xen component to request the access permission of the pmem
  regions, e.g. in ap

Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-20 Thread Haozhong Zhang

On 10/14/16 04:16 -0600, Jan Beulich wrote:

On 13.10.16 at 17:46,  wrote:

On 10/13/16 03:08 -0600, Jan Beulich wrote:

On 13.10.16 at 10:53,  wrote:

On 10/13/16 02:34 -0600, Jan Beulich wrote:

On 12.10.16 at 18:19,  wrote:

On Wed, Oct 12, 2016 at 9:01 AM, Jan Beulich  wrote:

On 12.10.16 at 17:42,  wrote:

On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulich  wrote:

On 12.10.16 at 16:58,  wrote:

On 10/12/16 05:32 -0600, Jan Beulich wrote:

On 12.10.16 at 12:33,  wrote:

The layout is shown as the following diagram.

+---+---+---+--+--+
| whatever used | Partition | Super | Reserved | /dev/pmem0p1 |
|  by kernel|   Table   | Block | for Xen  |  |
+---+---+---+--+--+
\_ ___/
  V
 /dev/pmem0


I have to admit that I dislike this, for not being OS-agnostic.
Neither should there be any Xen-specific region, nor should the
"whatever used by kernel" one be restricted to just Linux. What
I could see is an OS-reserved area ahead of the partition table,
the exact usage of which depends on which OS is currently
running (and in the Xen case this might be both Xen _and_ the
Dom0 kernel, arbitrated by a tbd protocol). After all, when
running under Xen, the Dom0 may not have a need for as much
control data as it has when running on bare hardware, for it
controlling less (if any) of the actual memory ranges when Xen
is present.



Isn't this OS-reserved area still not OS-agnostic, as it requires OS
to know where the reserved area is?  Or do you mean it's not if it's
defined by a protocol that is accepted by all OSes?


The latter - we clearly won't get away without some agreement on
where to retrieve position and size of this area. I was simply
assuming that such a protocol already exists.



No, we should not mix the struct page reservation that the Dom0 kernel
may actively use with the Xen reservation that the Dom0 kernel does
not consume.  Explain again what is wrong with the partition approach?


Not sure what was unclear in my previous reply. I don't think there
should be apriori knowledge of whether Xen is (going to be) used on
a system, and even if it gets used, but just occasionally, it would
(apart from the abstract considerations already given) be a waste
of resources to set something aside that could be used for other
purposes while Xen is not running. Static partitioning should only be
needed for persistent data.


The reservation needs to be persistent / static even if the data is
volatile, as is the case with struct page, because we can't have the
size of the device change depending on use.  So, from the aspect of
wasting space while Xen is not in use, both partitions and the
intrinsic reservation approach suffer the same problem. Setting that
aside I don't want to mix 2 different use cases into the same
reservation.


Then you didn't understand what I've said: I certainly didn't mean
the reservation to vary from a device perspective. However, when
Xen is in use I don't see why part of that static reservation couldn't
be used by Xen, and another part by the Dom0 kernel. The kernel
obviously would need to ask the hypervisor how much of the space
is left, and where that area starts.



I think Dan means that there should be a clear separation between
reservations for different usages (kernel/xen/...). The libnvdimm
driver is for the linux kernel and only needs to maintain the
reservation for kernel functionality. For others including xen/dm/...,
if they want reservation for their own purpose, they should maintain
their own reservations out of libnvdimm driver and avoid bothering the
libnvdimm driver (e.g. add specific handling in libnvdimm driver).

IIUC, one existing example is device-mapper device (dm) which needs to
reserve on-device area for its own meta-data. Its choice is to store
the meta-data on the block device (/dev/pmemN) provided by the
libnvdimm driver.

I think we can do the similar for Xen, like to lay another pseudo
device on /dev/pmem and do the reservation, like 2. in my previous
reply.


Well, my opinion certainly doesn't count much here, but I continue to
consider this a bad idea. For entities like drivers it may well be
appropriate, but I think there ought to be an independent concept
of "OS reserved", and in the Xen case this could then be shared
between hypervisor and Dom0 kernel.


No such independent concept seems exist right now. It may be hard to
define such concept, because it's hard to know the common requirements
(e.g. size/alignment/...)  from ALL OSes. Making each component to
maintain its own reservation in its own way seems more flexible.


Or if we were to consider Dom0
"just a guest", things should even be the other way around: Xen gets
all of the OS reserved space, and Dom0 needs something custom.



Sure, it's possible to implement the driver in a way that i

Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-14 Thread Haozhong Zhang

On 10/13/16 20:33 +0100, Andrew Cooper wrote:

On 13/10/16 19:59, Dan Williams wrote:

On Thu, Oct 13, 2016 at 9:01 AM, Andrew Cooper
 wrote:

On 13/10/16 16:40, Dan Williams wrote:

On Thu, Oct 13, 2016 at 2:08 AM, Jan Beulich  wrote:
[..]

I think we can do the similar for Xen, like to lay another pseudo
device on /dev/pmem and do the reservation, like 2. in my previous
reply.

Well, my opinion certainly doesn't count much here, but I continue to
consider this a bad idea. For entities like drivers it may well be
appropriate, but I think there ought to be an independent concept
of "OS reserved", and in the Xen case this could then be shared
between hypervisor and Dom0 kernel. Or if we were to consider Dom0
"just a guest", things should even be the other way around: Xen gets
all of the OS reserved space, and Dom0 needs something custom.

You haven't made the case why Xen is special and other applications of
persistent memory are not.

In a Xen system, Xen runs in the baremetal root-mode ring0, and dom0 is
a VM running in ring1/3 with the nvdimm driver.  This is the opposite
way around to the KVM model.

Dom0, being the hardware domain, has default ownership of all the
hardware, but to gain access in the first place, it must request a
mapping from Xen.

This is where my understanding the Xen model breaks down.  Are you
saying dom0 can't access the persistent memory range unless the ring0
agent has metadata storage space for tracking what it maps into dom0?


No.  I am trying to point out that the current suggestion wont work, and
needs re-designing.

Xen *must* be able to properly configure mappings of the NVDIMM for
dom0, *without* modifying any content on the NVDIMM.  Otherwise, data
corruption will occur.

Whether this means no Xen metadata, or the metadata living elsewhere in
regular ram, such as the main frametable, is an implementation detail.




Once dom0 has a mapping of the nvdimm, the nvdimm driver can go to work
and figure out what is on the DIMM, and which areas are safe to use.

I don't understand this ordering of events.  Dom0 needs to have a
mapping to even write the on-media structure to indicate a
reservation.  So, initial dom0 access can't depend on metadata
reservation already being present.


I agree.

Overall, I think the following is needed.

* Xen starts up.
** Xen might find some NVDIMM SPA/MFN ranges in the NFIT table, and
needs to note this information somehow.
** Xen might find some Type 7 E820 regions, and needs to note this
information somehow.


IIUC, this is to collect MFNs and no need to create frame table and
M2P at this stage. If so, what is different from ...


* Xen starts dom0.
* Once OSPM is running, a Xen component in Linux needs to collect and
report all NVDIMM SPA/MFN regions it knowns about.
** This covers the AML-only case, and the hotplug case.


... the MFNs reported here, especially that the former is a subset
(hotplug ones not included in the former) of latter.

(There is no E820 hole or SRAT entries to tell which address range is
reserved for hotplugged NVDIMM)


* Dom0 requests a mapping of the NVDIMMs via the usual mechanism.


Two questions:
1. Why is this request necessary? Even without such requests like what
  my current implementation, Dom0 can still access NVDIMM.

  Or do you mean Xen hypervisor should by default disallow Dom0 to
  access MFNs reported in previous step until they are requested?

2. Who initiates the requests? If it's the libnvdimm driver, that
  means we still need to introduce Xen specific code to the driver.

  Or the requests are issued by OSPM (or the Xen component you
  mentioned above) when they probe new dimms?

  For the latter, Dan, do you think it's acceptable in NFIT code to
  call the Xen component to request the access permission of the pmem
  regions, e.g. in apic_nfit_insert_resource(). Of course, it's only
  used for Dom0 case.


** This should work, as Xen is aware that there is something there to be
mapped (rather than just empty physical address space).
* Dom0 finds that some NVDIMM ranges are now available for use (probably
modelled as hotplug events).
* /dev/pmem $STUFF starts happening as normal.

At some pointer later after dom0 policy decisions are made (ultimately,
by the host administrator):
* If an area of NVDIMM is chosen for Xen to use, Dom0 needs to inform
Xen of the SPA/MFN regions which are safe to use.
* Xen then incorporates these regions into its idea of RAM, and starts
using them for whatever.



Agree. I think we may not need to fix the way/format/... to make the
reservation, and instead let the users (host administrators), who have
better understanding of their data, make the proper decision.

In a worse case that no reservation is made, Xen hypervisor could turn
to use RAM for management structures for NVDIMM, with the cost of less
RAM for guests.

Thanks,
Haozhong


Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-13 Thread Haozhong Zhang

On 10/13/16 03:08 -0600, Jan Beulich wrote:

On 13.10.16 at 10:53,  wrote:

On 10/13/16 02:34 -0600, Jan Beulich wrote:

On 12.10.16 at 18:19,  wrote:

On Wed, Oct 12, 2016 at 9:01 AM, Jan Beulich  wrote:

On 12.10.16 at 17:42,  wrote:

On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulich  wrote:

On 12.10.16 at 16:58,  wrote:

On 10/12/16 05:32 -0600, Jan Beulich wrote:

On 12.10.16 at 12:33,  wrote:

The layout is shown as the following diagram.

+---+---+---+--+--+
| whatever used | Partition | Super | Reserved | /dev/pmem0p1 |
|  by kernel|   Table   | Block | for Xen  |  |
+---+---+---+--+--+
\_ ___/
  V
 /dev/pmem0


I have to admit that I dislike this, for not being OS-agnostic.
Neither should there be any Xen-specific region, nor should the
"whatever used by kernel" one be restricted to just Linux. What
I could see is an OS-reserved area ahead of the partition table,
the exact usage of which depends on which OS is currently
running (and in the Xen case this might be both Xen _and_ the
Dom0 kernel, arbitrated by a tbd protocol). After all, when
running under Xen, the Dom0 may not have a need for as much
control data as it has when running on bare hardware, for it
controlling less (if any) of the actual memory ranges when Xen
is present.



Isn't this OS-reserved area still not OS-agnostic, as it requires OS
to know where the reserved area is?  Or do you mean it's not if it's
defined by a protocol that is accepted by all OSes?


The latter - we clearly won't get away without some agreement on
where to retrieve position and size of this area. I was simply
assuming that such a protocol already exists.



No, we should not mix the struct page reservation that the Dom0 kernel
may actively use with the Xen reservation that the Dom0 kernel does
not consume.  Explain again what is wrong with the partition approach?


Not sure what was unclear in my previous reply. I don't think there
should be apriori knowledge of whether Xen is (going to be) used on
a system, and even if it gets used, but just occasionally, it would
(apart from the abstract considerations already given) be a waste
of resources to set something aside that could be used for other
purposes while Xen is not running. Static partitioning should only be
needed for persistent data.


The reservation needs to be persistent / static even if the data is
volatile, as is the case with struct page, because we can't have the
size of the device change depending on use.  So, from the aspect of
wasting space while Xen is not in use, both partitions and the
intrinsic reservation approach suffer the same problem. Setting that
aside I don't want to mix 2 different use cases into the same
reservation.


Then you didn't understand what I've said: I certainly didn't mean
the reservation to vary from a device perspective. However, when
Xen is in use I don't see why part of that static reservation couldn't
be used by Xen, and another part by the Dom0 kernel. The kernel
obviously would need to ask the hypervisor how much of the space
is left, and where that area starts.



I think Dan means that there should be a clear separation between
reservations for different usages (kernel/xen/...). The libnvdimm
driver is for the linux kernel and only needs to maintain the
reservation for kernel functionality. For others including xen/dm/...,
if they want reservation for their own purpose, they should maintain
their own reservations out of libnvdimm driver and avoid bothering the
libnvdimm driver (e.g. add specific handling in libnvdimm driver).

IIUC, one existing example is device-mapper device (dm) which needs to
reserve on-device area for its own meta-data. Its choice is to store
the meta-data on the block device (/dev/pmemN) provided by the
libnvdimm driver.

I think we can do the similar for Xen, like to lay another pseudo
device on /dev/pmem and do the reservation, like 2. in my previous
reply.


Well, my opinion certainly doesn't count much here, but I continue to
consider this a bad idea. For entities like drivers it may well be
appropriate, but I think there ought to be an independent concept
of "OS reserved", and in the Xen case this could then be shared
between hypervisor and Dom0 kernel.


No such independent concept seems exist right now. It may be hard to
define such concept, because it's hard to know the common requirements
(e.g. size/alignment/...)  from ALL OSes. Making each component to
maintain its own reservation in its own way seems more flexible.


Or if we were to consider Dom0
"just a guest", things should even be the other way around: Xen gets
all of the OS reserved space, and Dom0 needs something custom.



Sure, it's possible to implement the driver in a way that if the
driver finds it runs on Xen, then it just leaves the OS reserved area

Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-13 Thread Haozhong Zhang

+Dan Williams

I accidentally dropped him in my last reply. Add him back.

On 10/13/16 16:53 +0800, Haozhong Zhang wrote:

On 10/13/16 02:34 -0600, Jan Beulich wrote:

On 12.10.16 at 18:19,  wrote:

On Wed, Oct 12, 2016 at 9:01 AM, Jan Beulich  wrote:

On 12.10.16 at 17:42,  wrote:

On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulich  wrote:

On 12.10.16 at 16:58,  wrote:

On 10/12/16 05:32 -0600, Jan Beulich wrote:

On 12.10.16 at 12:33,  wrote:

The layout is shown as the following diagram.

+---+---+---+--+--+
| whatever used | Partition | Super | Reserved | /dev/pmem0p1 |
|  by kernel|   Table   | Block | for Xen  |  |
+---+---+---+--+--+
   \_ ___/
 V
/dev/pmem0


I have to admit that I dislike this, for not being OS-agnostic.
Neither should there be any Xen-specific region, nor should the
"whatever used by kernel" one be restricted to just Linux. What
I could see is an OS-reserved area ahead of the partition table,
the exact usage of which depends on which OS is currently
running (and in the Xen case this might be both Xen _and_ the
Dom0 kernel, arbitrated by a tbd protocol). After all, when
running under Xen, the Dom0 may not have a need for as much
control data as it has when running on bare hardware, for it
controlling less (if any) of the actual memory ranges when Xen
is present.



Isn't this OS-reserved area still not OS-agnostic, as it requires OS
to know where the reserved area is?  Or do you mean it's not if it's
defined by a protocol that is accepted by all OSes?


The latter - we clearly won't get away without some agreement on
where to retrieve position and size of this area. I was simply
assuming that such a protocol already exists.



No, we should not mix the struct page reservation that the Dom0 kernel
may actively use with the Xen reservation that the Dom0 kernel does
not consume.  Explain again what is wrong with the partition approach?


Not sure what was unclear in my previous reply. I don't think there
should be apriori knowledge of whether Xen is (going to be) used on
a system, and even if it gets used, but just occasionally, it would
(apart from the abstract considerations already given) be a waste
of resources to set something aside that could be used for other
purposes while Xen is not running. Static partitioning should only be
needed for persistent data.


The reservation needs to be persistent / static even if the data is
volatile, as is the case with struct page, because we can't have the
size of the device change depending on use.  So, from the aspect of
wasting space while Xen is not in use, both partitions and the
intrinsic reservation approach suffer the same problem. Setting that
aside I don't want to mix 2 different use cases into the same
reservation.


Then you didn't understand what I've said: I certainly didn't mean
the reservation to vary from a device perspective. However, when
Xen is in use I don't see why part of that static reservation couldn't
be used by Xen, and another part by the Dom0 kernel. The kernel
obviously would need to ask the hypervisor how much of the space
is left, and where that area starts.



I think Dan means that there should be a clear separation between
reservations for different usages (kernel/xen/...). The libnvdimm
driver is for the linux kernel and only needs to maintain the
reservation for kernel functionality. For others including xen/dm/...,
if they want reservation for their own purpose, they should maintain
their own reservations out of libnvdimm driver and avoid bothering the
libnvdimm driver (e.g. add specific handling in libnvdimm driver).

IIUC, one existing example is device-mapper device (dm) which needs to
reserve on-device area for its own meta-data. Its choice is to store
the meta-data on the block device (/dev/pmemN) provided by the
libnvdimm driver.

I think we can do the similar for Xen, like to lay another pseudo
device on /dev/pmem and do the reservation, like 2. in my previous
reply.

Thanks,
Haozhong


The kernel needs to know about the struct page reservation because it
needs to manage the lifetime of page references vs the lifetime of the
device.  It does not have the same relationship with a Xen reservation
which is why I'm proposing they be managed separately.


I don't think I understand the difference you try to point out here.
Linux'es struct page and Xen's struct page_info serve the same
fundamental purpose.

Jan


___
Linux-nvdimm mailing list
linux-nvd...@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-13 Thread Haozhong Zhang

On 10/13/16 02:34 -0600, Jan Beulich wrote:

On 12.10.16 at 18:19,  wrote:

On Wed, Oct 12, 2016 at 9:01 AM, Jan Beulich  wrote:

On 12.10.16 at 17:42,  wrote:

On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulich  wrote:

On 12.10.16 at 16:58,  wrote:

On 10/12/16 05:32 -0600, Jan Beulich wrote:

On 12.10.16 at 12:33,  wrote:

The layout is shown as the following diagram.

+---+---+---+--+--+
| whatever used | Partition | Super | Reserved | /dev/pmem0p1 |
|  by kernel|   Table   | Block | for Xen  |  |
+---+---+---+--+--+
\_ ___/
  V
 /dev/pmem0


I have to admit that I dislike this, for not being OS-agnostic.
Neither should there be any Xen-specific region, nor should the
"whatever used by kernel" one be restricted to just Linux. What
I could see is an OS-reserved area ahead of the partition table,
the exact usage of which depends on which OS is currently
running (and in the Xen case this might be both Xen _and_ the
Dom0 kernel, arbitrated by a tbd protocol). After all, when
running under Xen, the Dom0 may not have a need for as much
control data as it has when running on bare hardware, for it
controlling less (if any) of the actual memory ranges when Xen
is present.



Isn't this OS-reserved area still not OS-agnostic, as it requires OS
to know where the reserved area is?  Or do you mean it's not if it's
defined by a protocol that is accepted by all OSes?


The latter - we clearly won't get away without some agreement on
where to retrieve position and size of this area. I was simply
assuming that such a protocol already exists.



No, we should not mix the struct page reservation that the Dom0 kernel
may actively use with the Xen reservation that the Dom0 kernel does
not consume.  Explain again what is wrong with the partition approach?


Not sure what was unclear in my previous reply. I don't think there
should be apriori knowledge of whether Xen is (going to be) used on
a system, and even if it gets used, but just occasionally, it would
(apart from the abstract considerations already given) be a waste
of resources to set something aside that could be used for other
purposes while Xen is not running. Static partitioning should only be
needed for persistent data.


The reservation needs to be persistent / static even if the data is
volatile, as is the case with struct page, because we can't have the
size of the device change depending on use.  So, from the aspect of
wasting space while Xen is not in use, both partitions and the
intrinsic reservation approach suffer the same problem. Setting that
aside I don't want to mix 2 different use cases into the same
reservation.


Then you didn't understand what I've said: I certainly didn't mean
the reservation to vary from a device perspective. However, when
Xen is in use I don't see why part of that static reservation couldn't
be used by Xen, and another part by the Dom0 kernel. The kernel
obviously would need to ask the hypervisor how much of the space
is left, and where that area starts.



I think Dan means that there should be a clear separation between
reservations for different usages (kernel/xen/...). The libnvdimm
driver is for the linux kernel and only needs to maintain the
reservation for kernel functionality. For others including xen/dm/...,
if they want reservation for their own purpose, they should maintain
their own reservations out of libnvdimm driver and avoid bothering the
libnvdimm driver (e.g. add specific handling in libnvdimm driver).

IIUC, one existing example is device-mapper device (dm) which needs to
reserve on-device area for its own meta-data. Its choice is to store
the meta-data on the block device (/dev/pmemN) provided by the
libnvdimm driver.

I think we can do the similar for Xen, like to lay another pseudo
device on /dev/pmem and do the reservation, like 2. in my previous
reply.

Thanks,
Haozhong


The kernel needs to know about the struct page reservation because it
needs to manage the lifetime of page references vs the lifetime of the
device.  It does not have the same relationship with a Xen reservation
which is why I'm proposing they be managed separately.


I don't think I understand the difference you try to point out here.
Linux'es struct page and Xen's struct page_info serve the same
fundamental purpose.

Jan



Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-12 Thread Haozhong Zhang

On 10/12/16 05:32 -0600, Jan Beulich wrote:

On 12.10.16 at 12:33,  wrote:

The layout is shown as the following diagram.

+---+---+---+--+--+
| whatever used | Partition | Super | Reserved | /dev/pmem0p1 |
|  by kernel|   Table   | Block | for Xen  |  |
+---+---+---+--+--+
\_ ___/
  V
 /dev/pmem0


I have to admit that I dislike this, for not being OS-agnostic.
Neither should there be any Xen-specific region, nor should the
"whatever used by kernel" one be restricted to just Linux. What
I could see is an OS-reserved area ahead of the partition table,
the exact usage of which depends on which OS is currently
running (and in the Xen case this might be both Xen _and_ the
Dom0 kernel, arbitrated by a tbd protocol). After all, when
running under Xen, the Dom0 may not have a need for as much
control data as it has when running on bare hardware, for it
controlling less (if any) of the actual memory ranges when Xen
is present.



Isn't this OS-reserved area still not OS-agnostic, as it requires OS
to know where the reserved area is?  Or do you mean it's not if it's
defined by a protocol that is accepted by all OSes?

Let me list another two methods just coming to my mind.

1. The first method extends the usage of the super block used by
  current Linux kernel to reserve space on pmem.

  Current Linux kernel places a super block of the following
  structure near the beginning of a pmem namespace.

   struct nd_pfn_sb {
   u8 signature[PFN_SIG_LEN];
   u8 uuid[16];
   u8 parent_uuid[16];
   __le32 flags;
   __le16 version_major;
   __le16 version_minor;
   __le64 dataoff; /* relative to namespace_base + start_pad */
   __le64 npfns;
   __le32 mode;
   /* minor-version-1 additions for section alignment */
   __le32 start_pad;
   __le32 end_trunc;
   /* minor-version-2 record the base alignment of the mapping */
   __le32 align;
   u8 padding[4000];
   __le64 checksum;
   }

   Two interesting fields here are 'dataoff' and 'mode':
   - 'dataoff' indicates the offset where the data area starts,
 ie. IIUC, the part that can be accessed via /dev/pmemN or
 /dev/daxN.
   - 'mode' indicates whether Linux puts struct page for this
 namespace in the ram (= PFN_MODE_RAM) or on the device (=
 PFN_MODE_PMEM).

   Currently for Linux, only 'mode' is customizable, while 'dataoff'
   is not. If mode == PFN_MODE_RAM, no reservation for struct page is
   made on the device, and dataoff starts almost immediately after
   the super block except a small reserved area in between for other
   structures and alignment. If mode == PFN_MODE_PMEM, the size of
   the reservation is decided by kernel, i.e. 64 bytes per struct
   page.

   I propose to make the size of the reserved area customizable,
   e.g. via ioctl and ndctl.
   - If mode == PFN_MODE_PMEM and
 * if the given reserved size is large enough to hold what an OS
   (not limited to Linux) wants to put in, then the OS just
   starts use it as desired;
 * if the given reserved size is not enough, then the OS reports
   error and may take other fallback actions.
   - If mode == PFN_MODE_RAM and
 * if the reserved size is zero, then it's the current way that
   Linux uses the device;
 * if the reserved size is non-zero, I would like to reserve this
   case for hypervisor (right now, namely Xen hypervisor)
   usage. That is, the OS should not use the reserved area. For
   Xen, we could add a function in xen driver in kernel to report
   the reserved area to hypervisor.

  I guess this might be the OS-agnostic way Jan expects, but Dan may
  object to.


2. Lay another pseudo device on the block device (e.g. /dev/pmemN)
  provided by the NVDIMM driver.

  This pseudo device can reserve the size according to user's
  requirement. The reservation information can be persistently
  recorded in a super block before the reserved area.

  This pseudo device also implements another pseudo block device to
  allow the non-reserved area be accessed as a block device (we can
  even implement it as DAX-capable).

  pseudo block device
/-^---\
+--+---+---+---+
|  whatever used   | Super |  reserved by  |   |
| by NVDIMM driver | Block | pseudo device |   |
+--+---+---+---+
\_ ___/
  V
  /dev/pmem0
  

Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-12 Thread Haozhong Zhang

On 10/11/16 13:17 -0700, Dan Williams wrote:

On Tue, Oct 11, 2016 at 12:48 PM, Konrad Rzeszutek Wilk
 wrote:

On Tue, Oct 11, 2016 at 12:28:56PM -0700, Dan Williams wrote:

On Tue, Oct 11, 2016 at 11:33 AM, Konrad Rzeszutek Wilk
 wrote:
> On Tue, Oct 11, 2016 at 10:51:19AM -0700, Dan Williams wrote:
[..]
>> Right, but why does the libnvdimm core need to know about this
>> specific Xen reservation?  For example, if Xen wants some in-kernel
>
> Let me turn this around - why does the libnvdimm core need to know about
> Linux specific parts? Shouldn't this be OS agnostic, so that FreeBSD
> for example can also poke a hole in this and fill it with its
> OS-management meta-data?

Specifically the core needs to know so that it can answer the Linux
specific question of whether the pfn returned by ->direct_access() has
a corresponding struct page or not. It's tied to the lifetime of the
device and the usage of the reservation needs to be coordinated
against the references of those pages.  If FreeBSD decides it needs to
reserve "struct page" capacity at the start of the device, I would
hope that it reuses the same on-device info block that Linux is using
and not create a new "FreeBSD-mode" device type.


The issue here (as I understand, I may be missing something new)
is that the size of this special namespace may be different. That is
the 'struct page' on FreeBSD could be 256 bytes while on Linux it is
64 bytes (numbers pulled out of the sky).

Hence one would have to expand or such to re-use this.


Sure, but we could support that today.  If FreeBSD lays down the info
block it is free to make a bigger reservation and Linux would be happy
to use a smaller subset.  If we, as an industry, want this "struct
page" reservation to be common we can take it to a standards body to
make as a cross-OS guarantee... but I think this is separate from the
Xen reservation.


To be honest I do not yet understand what metadata Xen wants to store
in the device, but it seems the producer and consumer of that metadata
is Xen itself and not the wider Linux kernel as is the case with
struct page.  Can you fill me in on what problem Xen solves with this


Exactly!

reservation?


The same as Linux - its variant of 'struct page'. Which I think is
smaller than the Linux one, but perhaps it is not?



If the hypervisor needs to know where it can store some metadata, can
that be satisfied with userspace tooling in Dom0? Something like,
"/dev/pmem0p1 == Xen metadata" and "/dev/pmem0p2 == DAX filesystem
with files to hand to guests".  So my question is not about the
rationale for having metadata, it's why does the Linux kernel need to
know about the Xen reservation? As far as I can see it is independent
/ opaque to the kernel.


Thank everyone for all these comments!

How about doing the reservation in the following way:

1. Create partition(s) on /dev/pmemX and make sure space besides the
  partition table and potential padding before the first partition is
  large enough to hold Xen's management structures and a super block
  introduced in step 2. The space besides the partition table,
  padding and the super block will be used as the reserved area.

2. Write a super block before above reserved area. The super block
  records the base address and the size of the reserved area. It also
  contains a signature and a checksum to identify itself.

The layout is shown as the following diagram.

+---+---+---+--+--+
| whatever used | Partition | Super | Reserved | /dev/pmem0p1 |
|  by kernel|   Table   | Block | for Xen  |  |
+---+---+---+--+--+
   \_ ___/
  V
 /dev/pmem0

Above two steps can be done via a userspace program and do not need
Xen hypervisor running. The partitions on the device can be used
regardless of the existence of Xen hypervisor.

3. When Xen is running, implement a function in Dom0 Linux xen driver
  (drivers/xen/) to response to udevd events that notify the
  detection of the pmem regions.

  This function searches on the pmem region for the super block
  created in step 2. If one is found, it will know this pmem region
  has been prepared for Xen usage.

  Then it gets the base address and size of the reserved area (from
  super block) and the entire address ranges of the pmem region (from
  pmem driver), and reports them to Xen hypervisor.

The implementation of this step can be completely included in the
kernel Xen driver. (It may also be implemented as a udevd service in
userspace, if it's not considered as unsafe)

Thanks,
Haozhong


Re: [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-11 Thread Haozhong Zhang
On 10/10/16 09:24, Dan Williams wrote:
> On Sun, Oct 9, 2016 at 11:32 PM, Haozhong Zhang
>  wrote:
> > On 10/09/16 20:45, Dan Williams wrote:
> >> On Sun, Oct 9, 2016 at 5:35 PM, Haozhong Zhang  
> >> wrote:
> >> > Overview
> >> > 
> >> > This RFC kernel patch series along with corresponding patch series of
> >> > Xen, QEMU and ndctl implements Xen vNVDIMM, which can map the host
> >> > NVDIMM devices to Xen HVM domU as vNVDIMM devices.
> >> >
> >> > Xen hypervisor does not include an NVDIMM driver, so it needs the
> >> > assistance from the driver in Dom0 Linux kernel to manage NVDIMM
> >> > devices. We currently only supports NVDIMM devices in pmem mode.
> >> >
> >> > Design and Implementation
> >> > =
> >> > The complete design can be found at
> >> >   
> >> > https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg01921.html.
> >>
> >> The KVM enabling for persistent memory does not need this support from
> >> the kernel, and as far as I can see neither does Xen. If the
> >> hypervisor needs to reserve some space it can simply trim the amount
> >> that it hands to the guest.
> >>
> >
> > Xen does not have the NVDIMM driver, so it cannot operate on NVDIMM
> > devices by itself. Instead it relies on the driver in Dom0 Linux to
> > probe NVDIMM and make the reservation.
> 
> I'm missing something because the design document talks about mmap'ing
> files on a DAX filesystem.  So, I'm assuming it is similar to the KVM
> NVDIMM virtualization case where an mmap range in dom0 is translated
> into a guest physical range.  The suggestion is to reserve some memory
> out of that mapping rather than introduce a new info block /
> reservation type to the sub-system.

Just like struct page to linux, Xen hypervisor uses a struct page_info
for its memory management. We are facing the same problem as linux
kernel: where we store those structs for pmem, and decided to put them
on a reserved area on pmem, similar to what pfn device in kernel does.

Reserving at the moment of mmap and out of what is mapped does not
work. It's a bootstrap problem: Xen needs the information of those
pages, which are stored in struct page_info, at the moment of
mapping. That is, page_info structs for pmem pages should be prepared
before those pages are actually used.

However, as the ongoing discussion in another thread with Andrew
Cooper, if Xen hypervisor turns to treat pmem pages as MMIO, then the
reservation may not be needed. Let's see what conclusion will be
reached there.

Thanks,
Haozhong


Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-10 Thread Haozhong Zhang
On 10/10/16 17:43, Andrew Cooper wrote:
> On 10/10/16 01:35, Haozhong Zhang wrote:
> > Overview
> > 
> > This RFC kernel patch series along with corresponding patch series of
> > Xen, QEMU and ndctl implements Xen vNVDIMM, which can map the host
> > NVDIMM devices to Xen HVM domU as vNVDIMM devices.
> >
> > Xen hypervisor does not include an NVDIMM driver, so it needs the
> > assistance from the driver in Dom0 Linux kernel to manage NVDIMM
> > devices. We currently only supports NVDIMM devices in pmem mode.
> >
> > Design and Implementation
> > =
> > The complete design can be found at
> >   
> > https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg01921.html.
> >
> > All patch series can be found at
> >   Xen:  https://github.com/hzzhan9/xen.git nvdimm-rfc-v1
> >   QEMU: https://github.com/hzzhan9/qemu.git xen-nvdimm-rfc-v1
> >   Linux kernel: https://github.com/hzzhan9/nvdimm.git xen-nvdimm-rfc-v1
> >   ndctl:https://github.com/hzzhan9/ndctl.git pfn-xen-rfc-v1
> >
> > Xen hypervisor needs assistance from Dom0 Linux kernel for following tasks:
> > 1) Reserve an area on NVDIMM devices for Xen hypervisor to place
> >memory management data structures, i.e. frame table and M2P table.
> > 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen
> >hypervisor.
> 
> Please can we take a step back here before diving down a rabbit hole.
> 
> 
> How do pblk/pmem regions appear in the E820 map at boot?  At the very
> least, I would expect at least a large reserved region.

ACPI specification does not require them to appear in E820, though
it defines E820 type-7 for persistent memory.

> 
> Is the MFN information (SPA in your terminology, so far as I can tell)
> available in any static APCI tables, or are they only available as a
> result of executing AML methods?
>

For NVDIMM devices already plugged at power on, their MFN information
can be got from NFIT table. However, MFN information for hotplugged
NVDIMM devices should be got via AML _FIT method, so point 2) is needed.

> 
> If the MFN information is only available via AML, then point 2) is
> needed, although the reporting back to Xen should be restricted to a xen
> component, rather than polluting the main device driver.
> 
> However, I can't see any justification for 1).  Dom0 should not be
> involved in Xen's management of its own frame table and m2p.  The mfns
> making up the pmem/pblk regions should be treated just like any other
> MMIO regions, and be handed wholesale to dom0 by default.
>

Do you mean to treat them as mmio pages of type p2m_mmio_direct and
map them to guest by map_mmio_regions()?

Thanks,
Haozhong


Re: [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-09 Thread Haozhong Zhang
On 10/09/16 20:45, Dan Williams wrote:
> On Sun, Oct 9, 2016 at 5:35 PM, Haozhong Zhang  
> wrote:
> > Overview
> > 
> > This RFC kernel patch series along with corresponding patch series of
> > Xen, QEMU and ndctl implements Xen vNVDIMM, which can map the host
> > NVDIMM devices to Xen HVM domU as vNVDIMM devices.
> >
> > Xen hypervisor does not include an NVDIMM driver, so it needs the
> > assistance from the driver in Dom0 Linux kernel to manage NVDIMM
> > devices. We currently only supports NVDIMM devices in pmem mode.
> >
> > Design and Implementation
> > =
> > The complete design can be found at
> >   
> > https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg01921.html.
> 
> The KVM enabling for persistent memory does not need this support from
> the kernel, and as far as I can see neither does Xen. If the
> hypervisor needs to reserve some space it can simply trim the amount
> that it hands to the guest.
>

Xen does not have the NVDIMM driver, so it cannot operate on NVDIMM
devices by itself. Instead it relies on the driver in Dom0 Linux to
probe NVDIMM and make the reservation.

> The usage of fiemap and the sysfs resource for the pmem device, as
> mentioned in the design document, does not seem to comprehend that
> file block allocations may be discontiguous and may change over time
> depending on the file.

True. I may need to find a way to notify Xen of the underlying
changes, so that Xen can then adjust the address mapping.

Thanks,
Haozhong


[RFC KERNEL PATCH 1/2] nvdimm: add PFN_MODE_XEN to pfn device for Xen usage

2016-10-09 Thread Haozhong Zhang
pfn device in PFN_MODE_XEN reserves an area for Xen hypervisor to place
its own pmem management data structures (i.e. frame table and M2P
table). The reserved area is not used and not mapped by Linux kernel,
and only the data area is mapped.

Signed-off-by: Haozhong Zhang 
---
Cc: Dan Williams 
Cc: Ross Zwisler 
Cc: Andrew Morton 
Cc: Johannes Thumshirn 
Cc: linux-kernel@vger.kernel.org
---
 drivers/nvdimm/namespace_devs.c |  2 ++
 drivers/nvdimm/nd.h |  7 +++
 drivers/nvdimm/pfn_devs.c   | 37 +
 drivers/nvdimm/pmem.c   | 36 +---
 include/linux/pfn_t.h   |  2 ++
 5 files changed, 77 insertions(+), 7 deletions(-)

diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 3509cff..b1df653 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -1394,6 +1394,8 @@ static ssize_t mode_show(struct device *dev,
claim = ndns->claim;
if (claim && is_nd_btt(claim))
mode = "safe";
+   else if (claim && is_nd_pfn_xen(claim))
+   mode = "xen";
else if (claim && is_nd_pfn(claim))
mode = "memory";
else if (claim && is_nd_dax(claim))
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index d3b2fca..6af3a78 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -192,6 +192,7 @@ enum nd_pfn_mode {
PFN_MODE_NONE,
PFN_MODE_RAM,
PFN_MODE_PMEM,
+   PFN_MODE_XEN,
 };
 
 struct nd_pfn {
@@ -272,6 +273,7 @@ struct nd_pfn *to_nd_pfn(struct device *dev);
 #if IS_ENABLED(CONFIG_NVDIMM_PFN)
 int nd_pfn_probe(struct device *dev, struct nd_namespace_common *ndns);
 bool is_nd_pfn(struct device *dev);
+bool is_nd_pfn_xen(struct device *dev);
 struct device *nd_pfn_create(struct nd_region *nd_region);
 struct device *nd_pfn_devinit(struct nd_pfn *nd_pfn,
struct nd_namespace_common *ndns);
@@ -289,6 +291,11 @@ static inline bool is_nd_pfn(struct device *dev)
return false;
 }
 
+static inline bool is_nd_pfn_xen(struct device *dev)
+{
+   return false;
+}
+
 static inline struct device *nd_pfn_create(struct nd_region *nd_region)
 {
return NULL;
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index cea8350..6624f72 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -45,6 +45,12 @@ bool is_nd_pfn(struct device *dev)
 }
 EXPORT_SYMBOL(is_nd_pfn);
 
+bool is_nd_pfn_xen(struct device *dev)
+{
+   return is_nd_pfn(dev) ? to_nd_pfn(dev)->mode == PFN_MODE_XEN : false;
+}
+EXPORT_SYMBOL(is_nd_pfn_xen);
+
 struct nd_pfn *to_nd_pfn(struct device *dev)
 {
struct nd_pfn *nd_pfn = container_of(dev, struct nd_pfn, dev);
@@ -64,6 +70,8 @@ static ssize_t mode_show(struct device *dev,
return sprintf(buf, "ram\n");
case PFN_MODE_PMEM:
return sprintf(buf, "pmem\n");
+   case PFN_MODE_XEN:
+   return sprintf(buf, "xen\n");
default:
return sprintf(buf, "none\n");
}
@@ -88,6 +96,9 @@ static ssize_t mode_store(struct device *dev,
} else if (strncmp(buf, "ram\n", n) == 0
|| strncmp(buf, "ram", n) == 0)
nd_pfn->mode = PFN_MODE_RAM;
+   else if (strncmp(buf, "xen\n", n) == 0
+   || strncmp(buf, "xen", n) == 0)
+   nd_pfn->mode = PFN_MODE_XEN;
else if (strncmp(buf, "none\n", n) == 0
|| strncmp(buf, "none", n) == 0)
nd_pfn->mode = PFN_MODE_NONE;
@@ -383,6 +394,7 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
switch (le32_to_cpu(pfn_sb->mode)) {
case PFN_MODE_RAM:
case PFN_MODE_PMEM:
+   case PFN_MODE_XEN:
break;
default:
return -ENXIO;
@@ -532,11 +544,10 @@ static struct vmem_altmap *__nvdimm_setup_pfn(struct 
nd_pfn *nd_pfn,
res->start += start_pad;
res->end -= end_trunc;
 
-   if (nd_pfn->mode == PFN_MODE_RAM) {
+   if (nd_pfn->mode == PFN_MODE_RAM || nd_pfn->mode == PFN_MODE_XEN) {
if (offset < SZ_8K)
return ERR_PTR(-EINVAL);
nd_pfn->npfns = le64_to_cpu(pfn_sb->npfns);
-   altmap = NULL;
} else if (nd_pfn->mode == PFN_MODE_PMEM) {
nd_pfn->npfns = (resource_size(res) - offset) / PAGE_SIZE;
if (le64_to_cpu(nd_pfn->pfn_sb->npfns) > nd_pfn->npfns)
@@ -544,11 +555,15 @@ static struct vmem_altmap *__nvdimm_setup_pfn(struct 
nd_pfn *nd_pfn,
"numb

[RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-09 Thread Haozhong Zhang
Overview

This RFC kernel patch series along with corresponding patch series of
Xen, QEMU and ndctl implements Xen vNVDIMM, which can map the host
NVDIMM devices to Xen HVM domU as vNVDIMM devices.

Xen hypervisor does not include an NVDIMM driver, so it needs the
assistance from the driver in Dom0 Linux kernel to manage NVDIMM
devices. We currently only supports NVDIMM devices in pmem mode.

Design and Implementation
=
The complete design can be found at
  https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg01921.html.

All patch series can be found at
  Xen:  https://github.com/hzzhan9/xen.git nvdimm-rfc-v1
  QEMU: https://github.com/hzzhan9/qemu.git xen-nvdimm-rfc-v1
  Linux kernel: https://github.com/hzzhan9/nvdimm.git xen-nvdimm-rfc-v1
  ndctl:https://github.com/hzzhan9/ndctl.git pfn-xen-rfc-v1

Xen hypervisor needs assistance from Dom0 Linux kernel for following tasks:
1) Reserve an area on NVDIMM devices for Xen hypervisor to place
   memory management data structures, i.e. frame table and M2P table.
2) Report SPA ranges of NVDIMM devices and the reserved area to Xen
   hypervisor.

For 1), Patch 1 implements a new mode PFN_MODE_XEN to pfn devices, which
make the reservation for Xen hypervisor.

For 2), Patch 2 uses a new Xen hypercall to report the address
information of pfn devices in PFN_MODE_XEN.

How to test
===
Please refer to the cover letter of Xen patch series
"[RFC XEN PATCH 00/16] Add vNVDIMM support to HVM domains".


Haozhong Zhang (2):
  nvdimm: add PFN_MODE_XEN to pfn device for Xen usage
  xen, nvdimm: report pfn devices in PFN_MODE_XEN to Xen hypervisor

 drivers/nvdimm/namespace_devs.c  |  2 ++
 drivers/nvdimm/nd.h  |  7 +
 drivers/nvdimm/pfn_devs.c| 37 +---
 drivers/nvdimm/pmem.c| 61 ++--
 drivers/xen/Makefile |  2 +-
 drivers/xen/pmem.c   | 53 ++
 include/linux/pfn_t.h|  2 ++
 include/xen/interface/platform.h | 13 +
 include/xen/pmem.h   | 32 +
 9 files changed, 201 insertions(+), 8 deletions(-)
 create mode 100644 drivers/xen/pmem.c
 create mode 100644 include/xen/pmem.h

-- 
2.10.1



[RFC KERNEL PATCH 2/2] xen, nvdimm: report pfn devices in PFN_MODE_XEN to Xen hypervisor

2016-10-09 Thread Haozhong Zhang
Xen hypervisor does not include NVDIMM driver and relies on the driver
in Dom0 Linux to probe pfn devices in PFN_MODE_XEN. Whenever such a pfn
device is probed, Dom0 Linux reports pages of the entire device, its
reserved area and data area to Xen hypervisor.

Signed-off-by: Haozhong Zhang 
---
Cc: Ross Zwisler 
Cc: Dan Williams 
Cc: Boris Ostrovsky 
Cc: David Vrabel 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: Arnd Bergmann 
Cc: linux-kernel@vger.kernel.org
Cc: xen-de...@lists.xenproject.org
---
 drivers/nvdimm/pmem.c| 25 +++
 drivers/xen/Makefile |  2 +-
 drivers/xen/pmem.c   | 53 
 include/xen/interface/platform.h | 13 ++
 include/xen/pmem.h   | 32 
 5 files changed, 124 insertions(+), 1 deletion(-)
 create mode 100644 drivers/xen/pmem.c
 create mode 100644 include/xen/pmem.h

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index d2c9ead..eab1ee4 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -33,6 +33,11 @@
 #include "pfn.h"
 #include "nd.h"
 
+#ifdef CONFIG_XEN
+#include 
+#include 
+#endif
+
 static struct device *to_dev(struct pmem_device *pmem)
 {
/*
@@ -364,6 +369,26 @@ static int pmem_attach_disk(struct device *dev,
 
revalidate_disk(disk);
 
+#ifdef CONFIG_XEN
+   if (xen_initial_domain() && is_nd_pfn_xen(dev)) {
+   uint64_t rsv_off, rsv_size, data_off, data_size;
+   int err;
+
+   rsv_off = le64_to_cpu(pfn_sb->start_pad) +
+ PFN_PHYS(altmap->reserve);
+   rsv_size = PFN_PHYS(altmap->free);
+   data_off = le32_to_cpu(pfn_sb->start_pad) + pmem->data_offset;
+   data_size = pmem->size - pmem->pfn_pad - pmem->data_offset;
+
+   err = xen_pmem_add(pmem->phys_addr, pmem->size,
+  rsv_off, rsv_size, data_off, data_size);
+   if (err) {
+   dev_err(dev, "failed to register to Xen\n");
+   return err;
+   }
+   }
+#endif
+
return 0;
 }
 
diff --git a/drivers/xen/Makefile b/drivers/xen/Makefile
index 8feab810..7f95156 100644
--- a/drivers/xen/Makefile
+++ b/drivers/xen/Makefile
@@ -1,6 +1,6 @@
 obj-$(CONFIG_HOTPLUG_CPU)  += cpu_hotplug.o
 obj-$(CONFIG_X86)  += fallback.o
-obj-y  += grant-table.o features.o balloon.o manage.o preempt.o time.o
+obj-y  += grant-table.o features.o balloon.o manage.o preempt.o time.o pmem.o
 obj-y  += events/
 obj-y  += xenbus/
 
diff --git a/drivers/xen/pmem.c b/drivers/xen/pmem.c
new file mode 100644
index 000..bb027a5
--- /dev/null
+++ b/drivers/xen/pmem.c
@@ -0,0 +1,53 @@
+/**
+ * pmem.c
+ * pmem file for domain 0 kernel
+ *
+ * Copyright (c) 2016, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; If not, see <http://www.gnu.org/licenses/>.
+ *
+ * Author: Haozhong Zhang 
+ */
+
+#include 
+#include 
+#include 
+
+int xen_pmem_add(uint64_t spa, size_t size,
+uint64_t rsv_off, size_t rsv_size,
+uint64_t data_off, size_t data_size)
+{
+   int rc;
+   struct xen_platform_op op;
+
+   if ((spa | size | rsv_off | rsv_size | data_off | data_size) &
+   (PAGE_SIZE - 1))
+   return -EINVAL;
+
+   op.cmd = XENPF_pmem_add;
+   op.u.pmem_add.spfn = PHYS_PFN(spa);
+   op.u.pmem_add.epfn = PHYS_PFN(spa) + PHYS_PFN(size);
+   op.u.pmem_add.rsv_spfn = PHYS_PFN(spa + rsv_off);
+   op.u.pmem_add.rsv_epfn = PHYS_PFN(spa + rsv_off + rsv_size);
+   op.u.pmem_add.data_spfn = PHYS_PFN(spa + data_off);
+   op.u.pmem_add.data_epfn = PHYS_PFN(spa + data_off + data_size);
+
+   rc = HYPERVISOR_platform_op(&op);
+   if (rc)
+   pr_err("Xen pmem add failed on 0x%llx ~ 0x%llx, error: %d\n",
+  spa, spa + size, rc);
+
+   return rc;
+}
+EXPORT_SYMBOL(xen_pmem_add);
diff --git a/include/xen/interface/platform.h b/include/xen/interface/platform.h
index 732efb0..6c51f0c 100644
--- a/include/xen/interface/platform.h
+++ b/include/xen/interface/platform.h
@@ -500,6 +500,18 @@ struct xenpf_symdat

Re: [PATCH 2/2] KVM: x86: zero MSR_IA32_FEATURE_CONTROL on reset

2016-07-08 Thread Haozhong Zhang
On 07/08/16 14:01, Paolo Bonzini wrote:
> Reported-by: Laszlo Ersek 
> Signed-off-by: Paolo Bonzini 
> ---
>  arch/x86/kvm/x86.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 388d9ffd7551..c01b0b3a06aa 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7494,6 +7494,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool 
> init_event)
>   if (!init_event) {
>   kvm_pmu_reset(vcpu);
>   vcpu->arch.smbase = 0x3;
> + vcpu->arch.msr_ia32_feature_control = 0;
>   }
>  
>   memset(vcpu->arch.regs, 0, sizeof(vcpu->arch.regs));
> -- 
> 1.8.3.1
> 

Actually I'm not sure whether zeroing MSR_IA32_FEATURE_CONTROL on
reset is the accurate behavior.

I know as Laszlo reported that Intel SDM says

  VMXON is also controlled by the IA32_FEATURE_CONTROL MSR (MSR
  address 3AH). This MSR is *cleared to zero* when a logical processor
  is reset

However, when the later section "ARCHITECTURAL MSRS" in Chapter
"MODEL-SPECIFIC REGISTERS (MSRS)" explains the Lock bit of
MSR_IA32_FEATURE_CONTROL in Table "IA-32 Architectural MSRs", it says

  ... once the Lock bit is set, the entire IA32_FEATURE_CONTROL
  contents are *preserved* across RESET when PWRGOOD is not
  deasserted.

This looks like a conflict. I'm asking my Intel colleagues for the
accurate behavior.

Thanks,
Haozhong


Re: [PATCH 1/2] KVM: x86: move MSR_IA32_FEATURE_CONTROL handling to x86.c

2016-07-08 Thread Haozhong Zhang
On 07/08/16 14:01, Paolo Bonzini wrote:
> Because the MSR is listed in msrs_to_save, it is exported to userspace
> for both AMD and Intel processors.  However, on AMD currently getting
> it will fail.
>

Is MSR_IA32_FEATURE_CONTROL already be excluded from msrs_to_save[] by
kvm_init_msr_list() on AMD hosts?

Haozhong

> vmx_set_msr must keep the case label in order to handle the "exit
> nested on reset by writing 0" case.
> 
> Signed-off-by: Paolo Bonzini 
> ---
>  arch/x86/include/asm/kvm_host.h |  8 
>  arch/x86/kvm/vmx.c  | 41 
> ++---
>  arch/x86/kvm/x86.c  | 11 +++
>  arch/x86/kvm/x86.h  |  8 
>  4 files changed, 37 insertions(+), 31 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index b771667e8e99..f77e5f5a6b01 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -486,6 +486,14 @@ struct kvm_vcpu_arch {
>   u64 ia32_xss;
>  
>   /*
> +  * Only bits masked by msr_ia32_feature_control_valid_bits can be set in
> +  * msr_ia32_feature_control. FEATURE_CONTROL_LOCKED is always included
> +  * in msr_ia32_feature_control_valid_bits.
> +  */
> + u64 msr_ia32_feature_control;
> + u64 msr_ia32_feature_control_valid_bits;
> +
> + /*
>* Paging state of the vcpu
>*
>* If the vcpu runs in guest mode with two level paging this still saves
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 15793a4c5df0..939cd8b835c2 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -611,14 +611,6 @@ struct vcpu_vmx {
>   bool guest_pkru_valid;
>   u32 guest_pkru;
>   u32 host_pkru;
> -
> - /*
> -  * Only bits masked by msr_ia32_feature_control_valid_bits can be set in
> -  * msr_ia32_feature_control. FEATURE_CONTROL_LOCKED is always included
> -  * in msr_ia32_feature_control_valid_bits.
> -  */
> - u64 msr_ia32_feature_control;
> - u64 msr_ia32_feature_control_valid_bits;
>  };
>  
>  enum segment_cache_field {
> @@ -2932,14 +2924,6 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 
> msr_index, u64 *pdata)
>   return 0;
>  }
>  
> -static inline bool vmx_feature_control_msr_valid(struct kvm_vcpu *vcpu,
> -  uint64_t val)
> -{
> - uint64_t valid_bits = to_vmx(vcpu)->msr_ia32_feature_control_valid_bits;
> -
> - return !(val & ~valid_bits);
> -}
> -
>  /*
>   * Reads an msr value (of 'msr_index') into 'pdata'.
>   * Returns 0 on success, non-0 otherwise.
> @@ -2983,14 +2967,11 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
>   break;
>   case MSR_IA32_MCG_EXT_CTL:
>   if (!msr_info->host_initiated &&
> - !(to_vmx(vcpu)->msr_ia32_feature_control &
> + !(vcpu->arch.msr_ia32_feature_control &
> FEATURE_CONTROL_LMCE))
>   return 1;
>   msr_info->data = vcpu->arch.mcg_ext_ctl;
>   break;
> - case MSR_IA32_FEATURE_CONTROL:
> - msr_info->data = to_vmx(vcpu)->msr_ia32_feature_control;
> - break;
>   case MSR_IA32_VMX_BASIC ... MSR_IA32_VMX_VMFUNC:
>   if (!nested_vmx_allowed(vcpu))
>   return 1;
> @@ -3081,18 +3062,18 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
>   break;
>   case MSR_IA32_MCG_EXT_CTL:
>   if ((!msr_info->host_initiated &&
> -  !(to_vmx(vcpu)->msr_ia32_feature_control &
> +  !(vcpu->arch.msr_ia32_feature_control &
>  FEATURE_CONTROL_LMCE)) ||
>   (data & ~MCG_EXT_CTL_LMCE_EN))
>   return 1;
>   vcpu->arch.mcg_ext_ctl = data;
>   break;
>   case MSR_IA32_FEATURE_CONTROL:
> - if (!vmx_feature_control_msr_valid(vcpu, data) ||
> - (to_vmx(vcpu)->msr_ia32_feature_control &
> + if (!feature_control_msr_valid(vcpu, data) ||
> + (vcpu->arch.msr_ia32_feature_control &
>FEATURE_CONTROL_LOCKED && !msr_info->host_initiated))
>   return 1;
> - vmx->msr_ia32_feature_control = data;
> + vcpu->arch.msr_ia32_feature_control = data;
>   if (msr_info->host_initiated && data == 0)
>   vmx_leave_nested(vcpu);
>   break;
> @@ -6964,7 +6945,7 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
>   return 1;
>   }
>  
> - if ((vmx->msr_ia32_feature_control & VMXON_NEEDED_FEATURES)
> + if ((vcpu->arch.msr_ia32_feature_control & VMXON_NEEDED_FEATURES)
>   != VMXON_NEEDED_FEATURES) {
>   kvm_inject_gp(vcpu, 0);
>   return 1;
> @@ -9081,8 +9062,6 @@ static struct k

Re: [PATCH v3 1/2] KVM: nVMX: Fix incorrect preemption timer vmexit in nested guest

2016-07-07 Thread Haozhong Zhang
On 07/07/16 15:07, Wanpeng Li wrote:
> 2016-07-07 15:02 GMT+08:00 Haozhong Zhang :
> > On 07/07/16 14:56, Wanpeng Li wrote:
> >> 2016-07-07 14:48 GMT+08:00 Haozhong Zhang :
> >> > On 07/07/16 11:46, Wanpeng Li wrote:
> >> >> From: Wanpeng Li 
> >> >>
> >> >> BUG: unable to handle kernel NULL pointer dereference at   
> >> >> (null)
> >> >> IP: [<  (null)>]   (null)
> >> >> PGD 0
> >> >> Oops: 0010 [#1] SMP
> >> >> Call Trace:
> >> >>  ? kvm_lapic_expired_hv_timer+0x47/0x90 [kvm]
> >> >>  handle_preemption_timer+0xe/0x20 [kvm_intel]
> >> >>  vmx_handle_exit+0x169/0x15a0 [kvm_intel]
> >> >>  ? kvm_arch_vcpu_ioctl_run+0xd5d/0x19d0 [kvm]
> >> >>  kvm_arch_vcpu_ioctl_run+0xdee/0x19d0 [kvm]
> >> >>  ? kvm_arch_vcpu_ioctl_run+0xd5d/0x19d0 [kvm]
> >> >>  ? vcpu_load+0x1c/0x60 [kvm]
> >> >>  ? kvm_arch_vcpu_load+0x57/0x260 [kvm]
> >> >>  kvm_vcpu_ioctl+0x2d3/0x7c0 [kvm]
> >> >>  do_vfs_ioctl+0x96/0x6a0
> >> >>  ? __fget_light+0x2a/0x90
> >> >>  SyS_ioctl+0x79/0x90
> >> >>  do_syscall_64+0x68/0x180
> >> >>  entry_SYSCALL64_slow_path+0x25/0x25
> >> >> Code:  Bad RIP value.
> >> >> RIP  [<  (null)>]   (null)
> >> >>  RSP 
> >> >> CR2: 
> >> >> ---[ end trace 9c70c48b1a2bc66e ]---
> >> >>
> >> >> This can be reproduced readily by preemption timer enabled on L0 and 
> >> >> disabled
> >> >> on L1.
> >> >>
> >> >> Preemption timer for nested VMX is emulated by hrtimer which is started 
> >> >> on L2
> >> >> entry, stopped on L2 exit and evaluated via the check_nested_events 
> >> >> hook. However,
> >> >> nested_vmx_exit_handled is always return true for preemption timer 
> >> >> vmexit, then
> >> >> the L1 preemption timer vmexit is captured and be treated as a L2 
> >> >> preemption
> >> >> timer vmexit, incurr a nested vmexit dereference NULL pointer.
> >> >>
> >> >> This patch fix it by depending on check_nested_events to capture L2 
> >> >> preemption
> >> >> timer(emulated hrtimer) expire and nested vmexit.
> >> >>
> >> >> Tested-by: Haozhong Zhang 
> >> >> Cc: Paolo Bonzini 
> >> >> Cc: Radim Krčmář 
> >> >> Cc: Yunhong Jiang 
> >> >> Cc: Jan Kiszka 
> >> >> Cc: Haozhong Zhang 
> >> >> Signed-off-by: Wanpeng Li 
> >> >> ---
> >> >> v2 -> v3:
> >> >>  * update patch subject
> >> >> v1 -> v2:
> >> >>  * fix typo in patch description
> >> >>
> >> >>  arch/x86/kvm/vmx.c | 2 ++
> >> >>  1 file changed, 2 insertions(+)
> >> >>
> >> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >> >> index 85e2f0a..29c16a8 100644
> >> >> --- a/arch/x86/kvm/vmx.c
> >> >> +++ b/arch/x86/kvm/vmx.c
> >> >> @@ -8041,6 +8041,8 @@ static bool nested_vmx_exit_handled(struct 
> >> >> kvm_vcpu *vcpu)
> >> >>   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_XSAVES);
> >> >>   case EXIT_REASON_PCOMMIT:
> >> >>   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_PCOMMIT);
> >> >> + case EXIT_REASON_PREEMPTION_TIMER:
> >> >> + return false;
> >> >
> >> > If patch 2 can avoid accidentally enabling preemption timer in vmcs02,
> >> > will this one still be needed?
> >>
> >> After complete "L1 TSC deadline timer to trigger while L2 is running",
> >> L0's preemption timer fire when L2 is running can result in
> >> (is_guest_mode(vcpu) && nested_vmx_exit_handled(vcpu)) be true, right?
> >>
> >
> > In prepare_vmcs02():
> >
> > exec_control &= ~PIN_BASED_VMX_PREEMPTION_TIMER;
> > ...
> > vmcs_write32(PIN_BASED_VM_EXEC_CONTROL, exec_control);
> >
> > so preemption timer will never be enabled while L2 guest is running.
> 
> It is not true after Paolo's patch.
>

OK, I didn't quite follow discussions before v3 and just noticed
Paolo's patch in v2 comments is to enable L1 preemption timer while L2
is running. Then this one looks reasonable.

Thanks,
Haozhong


Re: [PATCH v3 1/2] KVM: nVMX: Fix incorrect preemption timer vmexit in nested guest

2016-07-07 Thread Haozhong Zhang
On 07/07/16 14:56, Wanpeng Li wrote:
> 2016-07-07 14:48 GMT+08:00 Haozhong Zhang :
> > On 07/07/16 11:46, Wanpeng Li wrote:
> >> From: Wanpeng Li 
> >>
> >> BUG: unable to handle kernel NULL pointer dereference at   (null)
> >> IP: [<  (null)>]   (null)
> >> PGD 0
> >> Oops: 0010 [#1] SMP
> >> Call Trace:
> >>  ? kvm_lapic_expired_hv_timer+0x47/0x90 [kvm]
> >>  handle_preemption_timer+0xe/0x20 [kvm_intel]
> >>  vmx_handle_exit+0x169/0x15a0 [kvm_intel]
> >>  ? kvm_arch_vcpu_ioctl_run+0xd5d/0x19d0 [kvm]
> >>  kvm_arch_vcpu_ioctl_run+0xdee/0x19d0 [kvm]
> >>  ? kvm_arch_vcpu_ioctl_run+0xd5d/0x19d0 [kvm]
> >>  ? vcpu_load+0x1c/0x60 [kvm]
> >>  ? kvm_arch_vcpu_load+0x57/0x260 [kvm]
> >>  kvm_vcpu_ioctl+0x2d3/0x7c0 [kvm]
> >>  do_vfs_ioctl+0x96/0x6a0
> >>  ? __fget_light+0x2a/0x90
> >>  SyS_ioctl+0x79/0x90
> >>  do_syscall_64+0x68/0x180
> >>  entry_SYSCALL64_slow_path+0x25/0x25
> >> Code:  Bad RIP value.
> >> RIP  [<  (null)>]   (null)
> >>  RSP 
> >> CR2: 
> >> ---[ end trace 9c70c48b1a2bc66e ]---
> >>
> >> This can be reproduced readily by preemption timer enabled on L0 and 
> >> disabled
> >> on L1.
> >>
> >> Preemption timer for nested VMX is emulated by hrtimer which is started on 
> >> L2
> >> entry, stopped on L2 exit and evaluated via the check_nested_events hook. 
> >> However,
> >> nested_vmx_exit_handled is always return true for preemption timer vmexit, 
> >> then
> >> the L1 preemption timer vmexit is captured and be treated as a L2 
> >> preemption
> >> timer vmexit, incurr a nested vmexit dereference NULL pointer.
> >>
> >> This patch fix it by depending on check_nested_events to capture L2 
> >> preemption
> >> timer(emulated hrtimer) expire and nested vmexit.
> >>
> >> Tested-by: Haozhong Zhang 
> >> Cc: Paolo Bonzini 
> >> Cc: Radim Krčmář 
> >> Cc: Yunhong Jiang 
> >> Cc: Jan Kiszka 
> >> Cc: Haozhong Zhang 
> >> Signed-off-by: Wanpeng Li 
> >> ---
> >> v2 -> v3:
> >>  * update patch subject
> >> v1 -> v2:
> >>  * fix typo in patch description
> >>
> >>  arch/x86/kvm/vmx.c | 2 ++
> >>  1 file changed, 2 insertions(+)
> >>
> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >> index 85e2f0a..29c16a8 100644
> >> --- a/arch/x86/kvm/vmx.c
> >> +++ b/arch/x86/kvm/vmx.c
> >> @@ -8041,6 +8041,8 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu 
> >> *vcpu)
> >>   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_XSAVES);
> >>   case EXIT_REASON_PCOMMIT:
> >>   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_PCOMMIT);
> >> + case EXIT_REASON_PREEMPTION_TIMER:
> >> + return false;
> >
> > If patch 2 can avoid accidentally enabling preemption timer in vmcs02,
> > will this one still be needed?
> 
> After complete "L1 TSC deadline timer to trigger while L2 is running",
> L0's preemption timer fire when L2 is running can result in
> (is_guest_mode(vcpu) && nested_vmx_exit_handled(vcpu)) be true, right?
>

In prepare_vmcs02():

exec_control &= ~PIN_BASED_VMX_PREEMPTION_TIMER;
...
vmcs_write32(PIN_BASED_VM_EXEC_CONTROL, exec_control);

so preemption timer will never be enabled while L2 guest is running.

Haozhong


Re: [PATCH v3 1/2] KVM: nVMX: Fix incorrect preemption timer vmexit in nested guest

2016-07-06 Thread Haozhong Zhang
On 07/07/16 11:46, Wanpeng Li wrote:
> From: Wanpeng Li 
> 
> BUG: unable to handle kernel NULL pointer dereference at   (null)
> IP: [<  (null)>]   (null)
> PGD 0
> Oops: 0010 [#1] SMP
> Call Trace:
>  ? kvm_lapic_expired_hv_timer+0x47/0x90 [kvm]
>  handle_preemption_timer+0xe/0x20 [kvm_intel]
>  vmx_handle_exit+0x169/0x15a0 [kvm_intel]
>  ? kvm_arch_vcpu_ioctl_run+0xd5d/0x19d0 [kvm]
>  kvm_arch_vcpu_ioctl_run+0xdee/0x19d0 [kvm]
>  ? kvm_arch_vcpu_ioctl_run+0xd5d/0x19d0 [kvm]
>  ? vcpu_load+0x1c/0x60 [kvm]
>  ? kvm_arch_vcpu_load+0x57/0x260 [kvm]
>  kvm_vcpu_ioctl+0x2d3/0x7c0 [kvm]
>  do_vfs_ioctl+0x96/0x6a0
>  ? __fget_light+0x2a/0x90
>  SyS_ioctl+0x79/0x90
>  do_syscall_64+0x68/0x180
>  entry_SYSCALL64_slow_path+0x25/0x25
> Code:  Bad RIP value.
> RIP  [<  (null)>]   (null)
>  RSP 
> CR2: 
> ---[ end trace 9c70c48b1a2bc66e ]---
> 
> This can be reproduced readily by preemption timer enabled on L0 and disabled
> on L1.
> 
> Preemption timer for nested VMX is emulated by hrtimer which is started on L2
> entry, stopped on L2 exit and evaluated via the check_nested_events hook. 
> However,
> nested_vmx_exit_handled is always return true for preemption timer vmexit, 
> then
> the L1 preemption timer vmexit is captured and be treated as a L2 preemption
> timer vmexit, incurr a nested vmexit dereference NULL pointer.
> 
> This patch fix it by depending on check_nested_events to capture L2 preemption
> timer(emulated hrtimer) expire and nested vmexit.
> 
> Tested-by: Haozhong Zhang 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: Yunhong Jiang 
> Cc: Jan Kiszka 
> Cc: Haozhong Zhang 
> Signed-off-by: Wanpeng Li 
> ---
> v2 -> v3:
>  * update patch subject
> v1 -> v2:
>  * fix typo in patch description
> 
>  arch/x86/kvm/vmx.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 85e2f0a..29c16a8 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -8041,6 +8041,8 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu 
> *vcpu)
>   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_XSAVES);
>   case EXIT_REASON_PCOMMIT:
>   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_PCOMMIT);
> + case EXIT_REASON_PREEMPTION_TIMER:
> + return false;

If patch 2 can avoid accidentally enabling preemption timer in vmcs02,
will this one still be needed?

Haozhong


>   default:
>   return true;
>   }
> -- 
> 1.9.1
> 


Re: [PATCH v3] KVM: nVMX: Fix incorrect preemption timer vmexit in nested guest

2016-07-06 Thread Haozhong Zhang
On 07/06/16 15:32, Paolo Bonzini wrote:
> 
> 
> On 06/07/2016 15:26, Haozhong Zhang wrote:
> > On 07/06/16 19:42, Wanpeng Li wrote:
> >> From: Wanpeng Li 
> >>
> >> BUG: unable to handle kernel NULL pointer dereference at   (null)
> >> IP: [<  (null)>]   (null)
> >> PGD 0 
> >> Oops: 0010 [#1] SMP
> >> Call Trace:
> >>  ? kvm_lapic_expired_hv_timer+0x47/0x90 [kvm]
> >>  handle_preemption_timer+0xe/0x20 [kvm_intel]
> >>  vmx_handle_exit+0x169/0x15a0 [kvm_intel]
> >>  ? kvm_arch_vcpu_ioctl_run+0xd5d/0x19d0 [kvm]
> >>  kvm_arch_vcpu_ioctl_run+0xdee/0x19d0 [kvm]
> >>  ? kvm_arch_vcpu_ioctl_run+0xd5d/0x19d0 [kvm]
> >>  ? vcpu_load+0x1c/0x60 [kvm]
> >>  ? kvm_arch_vcpu_load+0x57/0x260 [kvm]
> >>  kvm_vcpu_ioctl+0x2d3/0x7c0 [kvm]
> >>  do_vfs_ioctl+0x96/0x6a0
> >>  ? __fget_light+0x2a/0x90
> >>  SyS_ioctl+0x79/0x90
> >>  do_syscall_64+0x68/0x180
> >>  entry_SYSCALL64_slow_path+0x25/0x25
> >> Code:  Bad RIP value.
> >> RIP  [<  (null)>]   (null)
> >>  RSP 
> >> CR2: 
> >> ---[ end trace 9c70c48b1a2bc66e ]---
> >>
> >> This can be reproduced readily by preemption timer enabled on L0 and 
> >> disabled 
> >> on L1.
> >>
> >> Preemption timer for nested VMX is emulated by hrtimer which is started on 
> >> L2
> >> entry, stopped on L2 exit and evaluated via the check_nested_events hook. 
> >> However,
> >> nested_vmx_exit_handled is always return true for preemption timer vmexit, 
> >> then 
> >> the L1 preemption timer vmexit is captured and be treated as a L2 
> >> preemption 
> >> timer vmexit, incurr a nested vmexit dereference NULL pointer.
> >>
> >> This patch fix it by depending on check_nested_events to capture L2 
> >> preemption 
> >> timer(emulated hrtimer) expire and nested vmexit.
> >>
> >> Cc: Paolo Bonzini 
> >> Cc: Radim Krčmář 
> >> Cc: Yunhong Jiang 
> >> Cc: Jan Kiszka 
> >> Cc: Haozhong Zhang 
> >> Signed-off-by: Wanpeng Li 
> >> ---
> >> v2 -> v3:
> >>  * update patch subject
> >> v1 -> v2:
> >>  * fix typo in patch description
> >>
> >>  arch/x86/kvm/vmx.c | 2 ++
> >>  1 file changed, 2 insertions(+)
> >>
> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >> index 85e2f0a..29c16a8 100644
> >> --- a/arch/x86/kvm/vmx.c
> >> +++ b/arch/x86/kvm/vmx.c
> >> @@ -8041,6 +8041,8 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu 
> >> *vcpu)
> >>return nested_cpu_has2(vmcs12, SECONDARY_EXEC_XSAVES);
> >>case EXIT_REASON_PCOMMIT:
> >>return nested_cpu_has2(vmcs12, SECONDARY_EXEC_PCOMMIT);
> >> +  case EXIT_REASON_PREEMPTION_TIMER:
> >> +  return false;
> >>default:
> >>return true;
> >>}
> >> -- 
> >> 1.9.1
> >>
> > 
> > This patch also fixed the crash of L1 Xen with L2 HVM guest. Xen does
> > not enable preemption timer for HVM guests, and will get panic if it
> > receives a preemption timer vmexit.
> 
> Thanks!  I'm still not sure why the bit is set in the vmcs02 though...
> 

Yes, it looks really weird.

I replaced "return false" in Wanpeng's patch by

pr_info("VMCS: preemption timer enabled = %d\n",
!!(vmcs_read32(PIN_BASED_VM_EXEC_CONTROL) & 
PIN_BASED_VMX_PREEMPTION_TIMER));

and redid my test. As expected, L1 Xen crashed due to the unexpected
preemption timer vmexit. I got a log from above statement just before crash:

VMCS: preemption timer enabled = 1

which is expected to be 0, because preemption timer is disabled in
vmcs02. I also modified L1 Xen to dump VMCS at crash, and it says
preemption timer is disabled.

I noticed Jim Mattson recently sent a patch "KVM: nVMX: Fix memory
corruption when using VMCS shadowing" to fix the inconsistency between
vmcs12 and its shadow. Is it relevant here? I'll test his patch
tomorrow.

Thanks,
Haozhong


Re: [PATCH v3] KVM: nVMX: Fix incorrect preemption timer vmexit in nested guest

2016-07-06 Thread Haozhong Zhang
On 07/06/16 19:42, Wanpeng Li wrote:
> From: Wanpeng Li 
> 
> BUG: unable to handle kernel NULL pointer dereference at   (null)
> IP: [<  (null)>]   (null)
> PGD 0 
> Oops: 0010 [#1] SMP
> Call Trace:
>  ? kvm_lapic_expired_hv_timer+0x47/0x90 [kvm]
>  handle_preemption_timer+0xe/0x20 [kvm_intel]
>  vmx_handle_exit+0x169/0x15a0 [kvm_intel]
>  ? kvm_arch_vcpu_ioctl_run+0xd5d/0x19d0 [kvm]
>  kvm_arch_vcpu_ioctl_run+0xdee/0x19d0 [kvm]
>  ? kvm_arch_vcpu_ioctl_run+0xd5d/0x19d0 [kvm]
>  ? vcpu_load+0x1c/0x60 [kvm]
>  ? kvm_arch_vcpu_load+0x57/0x260 [kvm]
>  kvm_vcpu_ioctl+0x2d3/0x7c0 [kvm]
>  do_vfs_ioctl+0x96/0x6a0
>  ? __fget_light+0x2a/0x90
>  SyS_ioctl+0x79/0x90
>  do_syscall_64+0x68/0x180
>  entry_SYSCALL64_slow_path+0x25/0x25
> Code:  Bad RIP value.
> RIP  [<  (null)>]   (null)
>  RSP 
> CR2: 
> ---[ end trace 9c70c48b1a2bc66e ]---
> 
> This can be reproduced readily by preemption timer enabled on L0 and disabled 
> on L1.
> 
> Preemption timer for nested VMX is emulated by hrtimer which is started on L2
> entry, stopped on L2 exit and evaluated via the check_nested_events hook. 
> However,
> nested_vmx_exit_handled is always return true for preemption timer vmexit, 
> then 
> the L1 preemption timer vmexit is captured and be treated as a L2 preemption 
> timer vmexit, incurr a nested vmexit dereference NULL pointer.
> 
> This patch fix it by depending on check_nested_events to capture L2 
> preemption 
> timer(emulated hrtimer) expire and nested vmexit.
> 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: Yunhong Jiang 
> Cc: Jan Kiszka 
> Cc: Haozhong Zhang 
> Signed-off-by: Wanpeng Li 
> ---
> v2 -> v3:
>  * update patch subject
> v1 -> v2:
>  * fix typo in patch description
> 
>  arch/x86/kvm/vmx.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 85e2f0a..29c16a8 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -8041,6 +8041,8 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu 
> *vcpu)
>   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_XSAVES);
>   case EXIT_REASON_PCOMMIT:
>   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_PCOMMIT);
> + case EXIT_REASON_PREEMPTION_TIMER:
> + return false;
>   default:
>   return true;
>   }
> -- 
> 1.9.1
> 

This patch also fixed the crash of L1 Xen with L2 HVM guest. Xen does
not enable preemption timer for HVM guests, and will get panic if it
receives a preemption timer vmexit.

Tested-by: Haozhong Zhang 


Re: [PATCH] KVM: VMX: switch to hrtimer for TSC deadline timer when L2 guest is running

2016-07-06 Thread Haozhong Zhang
On 07/06/16 04:14, Paolo Bonzini wrote:
> > On 07/06/16 16:01, Haozhong Zhang wrote:
> > > On 07/06/16 09:46, Paolo Bonzini wrote:
> > > > On 06/07/2016 08:05, Haozhong Zhang wrote:
> > > > >> > Nested preemption timer is emulated by hrtimer, so it doesn't
> > > > >> > influence vmcs02, why this is needed?
> > > > > Nested (L2) preemption timer is not affected for sure, and this patch
> > > > > is to fix another problem caused by using L1 preemption timer for L1
> > > > > TSC deadline timer. When we use L1 VMX preemption timer for L1 TSC
> > > > > deadline timer, we intend to use the pin-based exec config and the VMX
> > > > > preemption timer value in vmcs01. However, when L2 guest is running,
> > > > > vmcs02 is loaded as the current VMCS so that a different VMX
> > > > > preemption timer config is used (i.e. VMX preemption timer is disabled
> > > > > in prepare_vmcs02()). If we still use preemption timer for L1 TSC
> > > > > deadline timer at this moment, then L1 TSC deadline timer will not be
> > > > > able to be triggered when L2 guest is running.
> > > > 
> > > > The right fix then is to set the pin-based controls in prepare_vmcs02,
> > > > based on vcpu->arch.hv_deadline_tsc.
> > > 
> > > Then KVM needs to distinguish L1 and L2 VMEXITs for preemption timer
> > > (or does KVM already have this?). The current implementation which
> > > uses a separate hrtimer for L2 preemption timer can easily distinguish
> > > them.
> > 
> > And L1 hypervisor may also want to use preemption timer for L2
> > guest. In this case, L0 KVM can not reuse preemption timer for both L1
> > and L2.
> 
> The vmcs12's preemption timer is emulated by KVM through a separate hrtimer,
> so KVM would only use the vmcs02 preemption timer for the L1 APIC timer.

Ah, I got what you mean. Please drop this patch, and I'll send another
one later.

Thanks,
Haozhong


Re: [PATCH] KVM: VMX: switch to hrtimer for TSC deadline timer when L2 guest is running

2016-07-06 Thread Haozhong Zhang
On 07/06/16 16:01, Haozhong Zhang wrote:
> On 07/06/16 09:46, Paolo Bonzini wrote:
> > 
> > 
> > On 06/07/2016 08:05, Haozhong Zhang wrote:
> > >> > 
> > >> > Nested preemption timer is emulated by hrtimer, so it doesn't
> > >> > influence vmcs02, why this is needed?
> > > Nested (L2) preemption timer is not affected for sure, and this patch
> > > is to fix another problem caused by using L1 preemption timer for L1
> > > TSC deadline timer. When we use L1 VMX preemption timer for L1 TSC
> > > deadline timer, we intend to use the pin-based exec config and the VMX
> > > preemption timer value in vmcs01. However, when L2 guest is running,
> > > vmcs02 is loaded as the current VMCS so that a different VMX
> > > preemption timer config is used (i.e. VMX preemption timer is disabled
> > > in prepare_vmcs02()). If we still use preemption timer for L1 TSC
> > > deadline timer at this moment, then L1 TSC deadline timer will not be
> > > able to be triggered when L2 guest is running.
> > 
> > The right fix then is to set the pin-based controls in prepare_vmcs02,
> > based on vcpu->arch.hv_deadline_tsc.
> 
> Then KVM needs to distinguish L1 and L2 VMEXITs for preemption timer
> (or does KVM already have this?). The current implementation which
> uses a separate hrtimer for L2 preemption timer can easily distinguish
> them.
> 

And L1 hypervisor may also want to use preemption timer for L2
guest. In this case, L0 KVM can not reuse preemption timer for both L1
and L2.

Haozhong


Re: [PATCH] KVM: VMX: switch to hrtimer for TSC deadline timer when L2 guest is running

2016-07-06 Thread Haozhong Zhang
On 07/06/16 09:46, Paolo Bonzini wrote:
> 
> 
> On 06/07/2016 08:05, Haozhong Zhang wrote:
> >> > 
> >> > Nested preemption timer is emulated by hrtimer, so it doesn't
> >> > influence vmcs02, why this is needed?
> > Nested (L2) preemption timer is not affected for sure, and this patch
> > is to fix another problem caused by using L1 preemption timer for L1
> > TSC deadline timer. When we use L1 VMX preemption timer for L1 TSC
> > deadline timer, we intend to use the pin-based exec config and the VMX
> > preemption timer value in vmcs01. However, when L2 guest is running,
> > vmcs02 is loaded as the current VMCS so that a different VMX
> > preemption timer config is used (i.e. VMX preemption timer is disabled
> > in prepare_vmcs02()). If we still use preemption timer for L1 TSC
> > deadline timer at this moment, then L1 TSC deadline timer will not be
> > able to be triggered when L2 guest is running.
> 
> The right fix then is to set the pin-based controls in prepare_vmcs02,
> based on vcpu->arch.hv_deadline_tsc.

Then KVM needs to distinguish L1 and L2 VMEXITs for preemption timer
(or does KVM already have this?). The current implementation which
uses a separate hrtimer for L2 preemption timer can easily distinguish
them.

Haozhong


Re: [PATCH] KVM: VMX: switch to hrtimer for TSC deadline timer when L2 guest is running

2016-07-05 Thread Haozhong Zhang
On 07/06/16 13:37, Wanpeng Li wrote:
> Cc Jan,
> 2016-07-06 13:10 GMT+08:00 Haozhong Zhang :
> > A different VMCS is loaded when L2 guest is running, so it's incorrect
> > to use the VMX preemption timer for L1 TSC deadline timer. This patch
> > switches to hrtimer for L1 TSC deadline timer when entering L2 guest,
> > and switches back to VMX preemption timer when nested VMEXIT from L2 to
> > L1.
> 
> Nested preemption timer is emulated by hrtimer, so it doesn't
> influence vmcs02, why this is needed?

Nested (L2) preemption timer is not affected for sure, and this patch
is to fix another problem caused by using L1 preemption timer for L1
TSC deadline timer. When we use L1 VMX preemption timer for L1 TSC
deadline timer, we intend to use the pin-based exec config and the VMX
preemption timer value in vmcs01. However, when L2 guest is running,
vmcs02 is loaded as the current VMCS so that a different VMX
preemption timer config is used (i.e. VMX preemption timer is disabled
in prepare_vmcs02()). If we still use preemption timer for L1 TSC
deadline timer at this moment, then L1 TSC deadline timer will not be
able to be triggered when L2 guest is running.

Thanks,
Haozhong


[PATCH] KVM: VMX: switch to hrtimer for TSC deadline timer when L2 guest is running

2016-07-05 Thread Haozhong Zhang
A different VMCS is loaded when L2 guest is running, so it's incorrect
to use the VMX preemption timer for L1 TSC deadline timer. This patch
switches to hrtimer for L1 TSC deadline timer when entering L2 guest,
and switches back to VMX preemption timer when nested VMEXIT from L2 to
L1.

Signed-off-by: Haozhong Zhang 
---
 arch/x86/kvm/vmx.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 85e2f0a..cc29c2a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -10203,6 +10203,9 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool 
launch)
if (!vmcs02)
return -ENOMEM;
 
+   if (kvm_lapic_hv_timer_in_use(vcpu))
+   kvm_lapic_switch_to_sw_timer(vcpu);
+
enter_guest_mode(vcpu);
 
vmx->nested.vmcs01_tsc_offset = vmcs_read64(TSC_OFFSET);
@@ -10227,6 +10230,7 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool 
launch)
if (msr_entry_idx) {
leave_guest_mode(vcpu);
vmx_load_vmcs01(vcpu);
+   kvm_lapic_switch_to_hv_timer(vcpu);
nested_vmx_entry_failure(vcpu, vmcs12,
EXIT_REASON_MSR_LOAD_FAIL, msr_entry_idx);
return 1;
@@ -10700,6 +10704,7 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, 
u32 exit_reason,
nested_vmx_abort(vcpu, VMX_ABORT_SAVE_GUEST_MSR_FAIL);
 
vmx_load_vmcs01(vcpu);
+   kvm_lapic_switch_to_hv_timer(vcpu);
 
if ((exit_reason == EXIT_REASON_EXTERNAL_INTERRUPT)
&& nested_exit_intr_ack_set(vcpu)) {
-- 
2.9.0



[PATCH v3 2/3] KVM: VMX: validate individual bits of guest MSR_IA32_FEATURE_CONTROL

2016-06-22 Thread Haozhong Zhang
KVM currently does not check the value written to guest
MSR_IA32_FEATURE_CONTROL, though bits corresponding to disabled features
may be set. This patch makes KVM to validate individual bits written to
guest MSR_IA32_FEATURE_CONTROL according to enabled features.

Signed-off-by: Haozhong Zhang 
---
 arch/x86/kvm/vmx.c | 25 -
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index ad66978..0a3ccb0 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -612,7 +612,13 @@ struct vcpu_vmx {
u32 guest_pkru;
u32 host_pkru;
 
+   /*
+* Only bits masked by msr_ia32_feature_control_valid_bits can be set in
+* msr_ia32_feature_control. FEATURE_CONTROL_LOCKED is always included
+* in msr_ia32_feature_control_valid_bits.
+*/
u64 msr_ia32_feature_control;
+   u64 msr_ia32_feature_control_valid_bits;
 };
 
 enum segment_cache_field {
@@ -2929,6 +2935,14 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 
msr_index, u64 *pdata)
return 0;
 }
 
+static inline bool vmx_feature_control_msr_valid(struct kvm_vcpu *vcpu,
+uint64_t val)
+{
+   uint64_t valid_bits = to_vmx(vcpu)->msr_ia32_feature_control_valid_bits;
+
+   return !(val & ~valid_bits);
+}
+
 /*
  * Reads an msr value (of 'msr_index') into 'pdata'.
  * Returns 0 on success, non-0 otherwise.
@@ -3062,7 +3076,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
ret = kvm_set_msr_common(vcpu, msr_info);
break;
case MSR_IA32_FEATURE_CONTROL:
-   if (!nested_vmx_allowed(vcpu) ||
+   if (!vmx_feature_control_msr_valid(vcpu, data) ||
(to_vmx(vcpu)->msr_ia32_feature_control &
 FEATURE_CONTROL_LOCKED && !msr_info->host_initiated))
return 1;
@@ -9055,6 +9069,8 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, 
unsigned int id)
goto free_vmcs;
}
 
+   vmx->msr_ia32_feature_control_valid_bits = FEATURE_CONTROL_LOCKED;
+
return &vmx->vcpu;
 
 free_vmcs:
@@ -9202,6 +9218,13 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu)
vmx->nested.nested_vmx_secondary_ctls_high &=
~SECONDARY_EXEC_PCOMMIT;
}
+
+   if (nested_vmx_allowed(vcpu))
+   to_vmx(vcpu)->msr_ia32_feature_control_valid_bits |=
+   FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX;
+   else
+   to_vmx(vcpu)->msr_ia32_feature_control_valid_bits &=
+   ~FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX;
 }
 
 static void vmx_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
-- 
2.9.0



[PATCH v3 3/3] KVM: VMX: enable guest access to LMCE related MSRs

2016-06-22 Thread Haozhong Zhang
From: Ashok Raj 

On Intel platforms, this patch adds LMCE to KVM MCE supported
capabilities and handles guest access to LMCE related MSRs.

Signed-off-by: Ashok Raj 
[Haozhong: macro KVM_MCE_CAP_SUPPORTED => variable kvm_mce_cap_supported
   Only enable LMCE on Intel platform
   Check MSR_IA32_FEATURE_CONTROL when handling guest
 access to MSR_IA32_MCG_EXT_CTL]
Signed-off-by: Haozhong Zhang 
---
 arch/x86/include/asm/kvm_host.h |  5 +
 arch/x86/kvm/vmx.c  | 29 +
 arch/x86/kvm/x86.c  | 15 +--
 3 files changed, 43 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 360c517..7a628fb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -598,6 +598,7 @@ struct kvm_vcpu_arch {
u64 mcg_cap;
u64 mcg_status;
u64 mcg_ctl;
+   u64 mcg_ext_ctl;
u64 *mce_banks;
 
/* Cache MMIO info */
@@ -1008,6 +1009,8 @@ struct kvm_x86_ops {
 
int (*set_hv_timer)(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc);
void (*cancel_hv_timer)(struct kvm_vcpu *vcpu);
+
+   void (*setup_mce)(struct kvm_vcpu *vcpu);
 };
 
 struct kvm_arch_async_pf {
@@ -1082,6 +1085,8 @@ extern u64  kvm_max_tsc_scaling_ratio;
 /* 1ull << kvm_tsc_scaling_ratio_frac_bits */
 extern u64  kvm_default_tsc_scaling_ratio;
 
+extern u64 kvm_mce_cap_supported;
+
 enum emulation_result {
EMULATE_DONE, /* no further processing */
EMULATE_USER_EXIT,/* kvm_run ready for userspace exit */
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 0a3ccb0..943609f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2984,6 +2984,13 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
return 1;
msr_info->data = vmcs_read64(GUEST_BNDCFGS);
break;
+   case MSR_IA32_MCG_EXT_CTL:
+   if (!msr_info->host_initiated &&
+   !(to_vmx(vcpu)->msr_ia32_feature_control &
+ FEATURE_CONTROL_LMCE))
+   return 1;
+   msr_info->data = vcpu->arch.mcg_ext_ctl;
+   break;
case MSR_IA32_FEATURE_CONTROL:
msr_info->data = to_vmx(vcpu)->msr_ia32_feature_control;
break;
@@ -3075,6 +3082,14 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
case MSR_IA32_TSC_ADJUST:
ret = kvm_set_msr_common(vcpu, msr_info);
break;
+   case MSR_IA32_MCG_EXT_CTL:
+   if ((!msr_info->host_initiated &&
+!(to_vmx(vcpu)->msr_ia32_feature_control &
+  FEATURE_CONTROL_LMCE)) ||
+   (data & ~MCG_EXT_CTL_LMCE_EN))
+   return 1;
+   vcpu->arch.mcg_ext_ctl = data;
+   break;
case MSR_IA32_FEATURE_CONTROL:
if (!vmx_feature_control_msr_valid(vcpu, data) ||
(to_vmx(vcpu)->msr_ia32_feature_control &
@@ -6484,6 +6499,8 @@ static __init int hardware_setup(void)
 
kvm_set_posted_intr_wakeup_handler(wakeup_handler);
 
+   kvm_mce_cap_supported |= MCG_LMCE_P;
+
return alloc_kvm_area();
 
 out8:
@@ -11109,6 +11126,16 @@ out:
return ret;
 }
 
+static void vmx_setup_mce(struct kvm_vcpu *vcpu)
+{
+   if (vcpu->arch.mcg_cap & MCG_LMCE_P)
+   to_vmx(vcpu)->msr_ia32_feature_control_valid_bits |=
+   FEATURE_CONTROL_LMCE;
+   else
+   to_vmx(vcpu)->msr_ia32_feature_control_valid_bits &=
+   ~FEATURE_CONTROL_LMCE;
+}
+
 static struct kvm_x86_ops vmx_x86_ops = {
.cpu_has_kvm_support = cpu_has_kvm_support,
.disabled_by_bios = vmx_disabled_by_bios,
@@ -11238,6 +11265,8 @@ static struct kvm_x86_ops vmx_x86_ops = {
.set_hv_timer = vmx_set_hv_timer,
.cancel_hv_timer = vmx_cancel_hv_timer,
 #endif
+
+   .setup_mce = vmx_setup_mce,
 };
 
 static int __init vmx_init(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2992196..0a42fc7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -70,7 +70,8 @@
 
 #define MAX_IO_MSRS 256
 #define KVM_MAX_MCE_BANKS 32
-#define KVM_MCE_CAP_SUPPORTED (MCG_CTL_P | MCG_SER_P)
+u64 __read_mostly kvm_mce_cap_supported = MCG_CTL_P | MCG_SER_P;
+EXPORT_SYMBOL_GPL(kvm_mce_cap_supported);
 
 #define emul_to_vcpu(ctxt) \
container_of(ctxt, struct kvm_vcpu, arch.emulate_ctxt)
@@ -984,6 +985,7 @@ static u32 emulated_msrs[] = {
MSR_IA32_MISC_ENABLE,
MSR_IA32_MCG_STATUS,
MSR_IA32_MCG_CTL,
+   MSR_IA32_MCG_EXT_CTL,
MSR_IA32_SMBASE,
 };
 
@@ -2685,11 +2687,9 @@ long kvm_arch_dev_ioctl(struct file *filp,
   

[PATCH v3 0/3] Add KVM support for Intel local MCE

2016-06-22 Thread Haozhong Zhang
Changes in v3:
 * Make guest MSR_IA32_FEATURE_CONTROL always available (Paolo
   Bonzini) and remove the nested vmx check in the 'get' case in patch
   1 (Borislav Petkov).
 * Always mark the locked bit of MSR_IA32_FEATURE_CONTROL
   valid. (Paolo Bonzini)
 * Remove the now unnecessary macros to set
   msr_ia32_feature_control_valid_bits. (Paolo Bonzini)
 * Remove the unnecessary check of MCG_LMCE_P in v2
   vmx_mcg_ext_ctl_msr_present() and inline the remaining part. (Paolo
   Bonzini)

Changes in v2:
 * v1 Patch 1 becomes v2 Patch 3.
 * Fix COB chain in Patch 3. (Boris Petkov)
 * (Patch 1) Move msr_ia32_feature_control from nested_vmx to
   vcpu_vmx, because it does not depend only on nested after this
   patch series. (Radim Krčmář)
 * (Patch 2) Add a valid bitmask for MSR_IA32_FEATURE_CONTROL to allow
   checking individual bits of MSR_IA32_FEATURE_CONTROL according to
   enabled features. (Radim Krčmář)
 * Move the common check in handling MSR_IA32_MCG_EXT_CTL to function
   vmx_mcg_ext_ctl_msr_present. (Radim Krčmář)

Changes in v1:
 * Change macro KVM_MCE_CAP_SUPPORTED to variable kvm_mce_cap_supported.
 * Include LMCE capability in kvm_mce_cap_supported only on Intel CPU,
   i.e. LMCE can be enabled only on Intel CPU.
 * Check if LMCE is enabled in guest MSR_IA32_FEATURE_CONTROL when
   handling guest access to MSR_IA32_MCG_EXT_CTL.

This patch series along with the corresponding QEMU patch series (sent
via another email with title "[PATCH v5 0/4] Add QEMU support for
Intel local MCE") enables Intel local MCE feature for guest. This KVM
patch handles guest access to LMCE-related MSR (MSR_IA32_MCG_EXT_CTL
and MSR_IA32_FEATURE_CONTROL).

Ashok Raj (1):
  KVM: VMX: enable guest access to LMCE related MSRs

Haozhong Zhang (2):
  KVM: VMX: move msr_ia32_feature_control to vcpu_vmx
  KVM: VMX: validate individual bits of guest MSR_IA32_FEATURE_CONTROL

 arch/x86/include/asm/kvm_host.h |  5 +++
 arch/x86/kvm/vmx.c  | 67 -
 arch/x86/kvm/x86.c  | 15 +
 3 files changed, 73 insertions(+), 14 deletions(-)

-- 
2.9.0



[PATCH v3 1/3] KVM: VMX: move msr_ia32_feature_control to vcpu_vmx

2016-06-22 Thread Haozhong Zhang
msr_ia32_feature_control will be used for LMCE and not depend only on
nested anymore, so move it from struct nested_vmx to struct vcpu_vmx.

Signed-off-by: Haozhong Zhang 
---
 arch/x86/kvm/vmx.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index e185649..ad66978 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -428,7 +428,6 @@ struct nested_vmx {
struct pi_desc *pi_desc;
bool pi_pending;
u16 posted_intr_nv;
-   u64 msr_ia32_feature_control;
 
struct hrtimer preemption_timer;
bool preemption_timer_expired;
@@ -612,6 +611,8 @@ struct vcpu_vmx {
bool guest_pkru_valid;
u32 guest_pkru;
u32 host_pkru;
+
+   u64 msr_ia32_feature_control;
 };
 
 enum segment_cache_field {
@@ -2970,9 +2971,7 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
msr_info->data = vmcs_read64(GUEST_BNDCFGS);
break;
case MSR_IA32_FEATURE_CONTROL:
-   if (!nested_vmx_allowed(vcpu))
-   return 1;
-   msr_info->data = to_vmx(vcpu)->nested.msr_ia32_feature_control;
+   msr_info->data = to_vmx(vcpu)->msr_ia32_feature_control;
break;
case MSR_IA32_VMX_BASIC ... MSR_IA32_VMX_VMFUNC:
if (!nested_vmx_allowed(vcpu))
@@ -3064,10 +3063,10 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
break;
case MSR_IA32_FEATURE_CONTROL:
if (!nested_vmx_allowed(vcpu) ||
-   (to_vmx(vcpu)->nested.msr_ia32_feature_control &
+   (to_vmx(vcpu)->msr_ia32_feature_control &
 FEATURE_CONTROL_LOCKED && !msr_info->host_initiated))
return 1;
-   vmx->nested.msr_ia32_feature_control = data;
+   vmx->msr_ia32_feature_control = data;
if (msr_info->host_initiated && data == 0)
vmx_leave_nested(vcpu);
break;
@@ -6939,7 +6938,7 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
return 1;
}
 
-   if ((vmx->nested.msr_ia32_feature_control & VMXON_NEEDED_FEATURES)
+   if ((vmx->msr_ia32_feature_control & VMXON_NEEDED_FEATURES)
!= VMXON_NEEDED_FEATURES) {
kvm_inject_gp(vcpu, 0);
return 1;
-- 
2.9.0



Re: [PATCH v2 3/3] KVM: VMX: enable guest access to LMCE related MSRs

2016-06-19 Thread Haozhong Zhang
On 06/17/16 14:15, Eduardo Habkost wrote:
> On Fri, Jun 17, 2016 at 09:11:16AM +0800, Haozhong Zhang wrote:
> > On 06/16/16 11:55, Eduardo Habkost wrote:
> > > On Thu, Jun 16, 2016 at 12:04:50PM +0200, Paolo Bonzini wrote:
> > > > On 16/06/2016 08:05, Haozhong Zhang wrote:
> > > > > From: Ashok Raj 
> > > > > 
> > > > > On Intel platforms, this patch adds LMCE to KVM MCE supported
> > > > > capabilities and handles guest access to LMCE related MSRs.
> > > > > 
> > > > > Signed-off-by: Ashok Raj 
> > > > > [Haozhong: macro KVM_MCE_CAP_SUPPORTED => variable 
> > > > > kvm_mce_cap_supported
> > > > >Only enable LMCE on Intel platform
> > > > >  Check MSR_IA32_FEATURE_CONTROL when handling guest
> > > > >access to MSR_IA32_MCG_EXT_CTL]
> > > > > Signed-off-by: Haozhong Zhang 
> > > [...]
> > > > > @@ -6433,6 +6455,8 @@ static __init int hardware_setup(void)
> > > > >  
> > > > >   kvm_set_posted_intr_wakeup_handler(wakeup_handler);
> > > > >  
> > > > > + kvm_mce_cap_supported |= MCG_LMCE_P;
> > > > 
> > > > Ah, so virtual LMCE is available on all processors!  This is
> > > > interesting, but it also makes it more complicated to handle in QEMU; a
> > > > new QEMU generally doesn't require a new kernel.
> > > > 
> > > > Eduardo, any ideas?
> > > 
> > > (CCing libvirt list)
> > > 
> > > As we shouldn't make machine-type changes introduce new host
> > > requirements, it looks like we need to either add a new set of
> > > CPU models (unreasonable), or expect management software to
> > > explicitly enable LMCE after ensuring the host supports it.
> > >
> > > Or we could wait for a reasonable time after the feature is
> > > available in the kernel, and declare that QEMU as a whole
> > > requires a newer kernel. But how much time would be reasonable
> > > for that?
> > > 
> > > Long term, I believe we should think of a better solution. I
> > > don't think it is reasonable to require new libvirt code to be
> > > written for every single low-level feature that requires a newer
> > > kernel or newer host hardware. Maybe new introspection interfaces
> > > that would allow us to drop the "no new requirements on
> > > machine-type changes" rule?
> > >
> > 
> > Because new MSR (MSR_IA32_MCG_EXT_CTL) and new MSR bit
> > (FEATURE_CONTROL_LMCE in MSR_IA32_FEATURE_CONTROL) are introduced by
> > LMCE, QEMU requires new KVM which can handle those changes.
> 
> If I understood correctly, you are describing the second option
> above (declaring that QEMU as a whole requires a newer kernel).
> 
> > 
> > I'm not familiar with libvirt. Does the requirement of new KVM
> > capability bring any troubles to libvirt?
>
> It does, assuming that we still support running QEMU under an
> older kernel where KVM doesn't LMCE. In this case, the pc-2.6
> machine-type will run, but the pc-2.7 machine-type won't.
>
> The requirement of new KVM capabilities based on the machine-type
> is a problem for livirt. libvirt have some host-capabilities APIs
> to allow software to check if the VM can be run on (or migrated
> to) a host, but none of them are based on machine-type.
> 
> This is not necessarily specific to libvirt: people may have
> their own configuration or scripts that use the default "pc"
> alias, and a QEMU upgrade shouldn't break their configuration.
>

Thanks for the explanation!

If we disable LMCE in QEMU by default (even for -cpu host), will it
still be a problem? That is,

- pc-2.7 can continue to run on old kernels unless users explicitly
  require LMCE

- existing libvirt VM configurations can continue to work on pc-2.7
  because LMCE is not specified in those configurations and are
  disabled by default (i.e. no requirement for new kernels)

- existing QEMU configurations/scripts using pc alias can continue to
  work on pc-27 for the same reason above.

Thanks,
Haozhong


Re: [PATCH v2 3/3] KVM: VMX: enable guest access to LMCE related MSRs

2016-06-16 Thread Haozhong Zhang
On 06/16/16 11:55, Eduardo Habkost wrote:
> On Thu, Jun 16, 2016 at 12:04:50PM +0200, Paolo Bonzini wrote:
> > On 16/06/2016 08:05, Haozhong Zhang wrote:
> > > From: Ashok Raj 
> > > 
> > > On Intel platforms, this patch adds LMCE to KVM MCE supported
> > > capabilities and handles guest access to LMCE related MSRs.
> > > 
> > > Signed-off-by: Ashok Raj 
> > > [Haozhong: macro KVM_MCE_CAP_SUPPORTED => variable kvm_mce_cap_supported
> > >Only enable LMCE on Intel platform
> > >  Check MSR_IA32_FEATURE_CONTROL when handling guest
> > >access to MSR_IA32_MCG_EXT_CTL]
> > > Signed-off-by: Haozhong Zhang 
> [...]
> > > @@ -6433,6 +6455,8 @@ static __init int hardware_setup(void)
> > >  
> > >   kvm_set_posted_intr_wakeup_handler(wakeup_handler);
> > >  
> > > + kvm_mce_cap_supported |= MCG_LMCE_P;
> > 
> > Ah, so virtual LMCE is available on all processors!  This is
> > interesting, but it also makes it more complicated to handle in QEMU; a
> > new QEMU generally doesn't require a new kernel.
> > 
> > Eduardo, any ideas?
> 
> (CCing libvirt list)
> 
> As we shouldn't make machine-type changes introduce new host
> requirements, it looks like we need to either add a new set of
> CPU models (unreasonable), or expect management software to
> explicitly enable LMCE after ensuring the host supports it.
>
> Or we could wait for a reasonable time after the feature is
> available in the kernel, and declare that QEMU as a whole
> requires a newer kernel. But how much time would be reasonable
> for that?
> 
> Long term, I believe we should think of a better solution. I
> don't think it is reasonable to require new libvirt code to be
> written for every single low-level feature that requires a newer
> kernel or newer host hardware. Maybe new introspection interfaces
> that would allow us to drop the "no new requirements on
> machine-type changes" rule?
>

Because new MSR (MSR_IA32_MCG_EXT_CTL) and new MSR bit
(FEATURE_CONTROL_LMCE in MSR_IA32_FEATURE_CONTROL) are introduced by
LMCE, QEMU requires new KVM which can handle those changes.

I'm not familiar with libvirt. Does the requirement of new KVM
capability bring any troubles to libvirt?

Thanks,
Haozhong


Re: [PATCH v2 1/3] KVM: VMX: move msr_ia32_feature_control to vcpu_vmx

2016-06-16 Thread Haozhong Zhang
On 06/16/16 13:49, Borislav Petkov wrote:
> On Thu, Jun 16, 2016 at 02:05:29PM +0800, Haozhong Zhang wrote:
> > msr_ia32_feature_control will be used for LMCE and not depend only on
> > nested anymore, so move it from struct nested_vmx to struct vcpu_vmx.
> > 
> > Signed-off-by: Haozhong Zhang 
> > ---
> >  arch/x86/kvm/vmx.c | 11 ++-
> >  1 file changed, 6 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 57ec6a4..6b63f2d 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -421,7 +421,6 @@ struct nested_vmx {
> > struct pi_desc *pi_desc;
> > bool pi_pending;
> > u16 posted_intr_nv;
> > -   u64 msr_ia32_feature_control;
> >  
> > struct hrtimer preemption_timer;
> > bool preemption_timer_expired;
> > @@ -602,6 +601,8 @@ struct vcpu_vmx {
> > bool guest_pkru_valid;
> > u32 guest_pkru;
> > u32 host_pkru;
> > +
> > +   u64 msr_ia32_feature_control;
> >  };
> >  
> >  enum segment_cache_field {
> > @@ -2907,7 +2908,7 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct 
> > msr_data *msr_info)
> > case MSR_IA32_FEATURE_CONTROL:
> > if (!nested_vmx_allowed(vcpu))
> > return 1;
> > -   msr_info->data = to_vmx(vcpu)->nested.msr_ia32_feature_control;
> > +   msr_info->data = to_vmx(vcpu)->msr_ia32_feature_control;
> 
> Since this moves out of struct nested_vmx, that check above it:
> 
>   if (!nested_vmx_allowed(vcpu))
> 
> should not influence it anymore, no?
> 
> Ditto for the rest.
> 

The same check in the set case still makes sense. I can remove the
check here and leave it in the set case.

Thanks,
Haozhong


Re: [PATCH v2 2/3] KVM: VMX: validate individual bits of guest MSR_IA32_FEATURE_CONTROL

2016-06-16 Thread Haozhong Zhang
On 06/16/16 13:19, Paolo Bonzini wrote:
> 
> 
> On 16/06/2016 13:16, Haozhong Zhang wrote:
> >> However, I think FEATURE_CONTROL_LOCKED should always be writable.  If
> >> you change that, it's simpler to just do |= and &= in the caller.
> > 
> > These two functions (add/del) are to prevent callers from forgetting
> > setting/clearing FEATURE_CONTROL_LOCKED in
> > msr_ia32_feature_control_valid_bits: it should be set if any feature
> > bit is set, and be cleared if all feature bits are cleared. The second
> > rule could relaxed as we can always present MSR_IA32_FEATURE_CONTROL
> > to guest.
> 
> Yes, this means that FEATURE_CONTROL_LOCKED effectively is always valid.
>  So you end up with just &= to clear and |= to set.
> 
> > I'm okey to let callers take care for the locked bit.
> > 
> >>> + to_vmx(vcpu)->msr_ia32_feature_control_valid_bits |=
> >>> + bits | FEATURE_CONTROL_LOCKED;
> >>> +}
> >>> +
> >>> +static void feature_control_valid_bits_del(struct kvm_vcpu *vcpu, 
> >>> uint64_t bits)
> >>> +{
> >>> + uint64_t *valid_bits =
> >>> + &to_vmx(vcpu)->msr_ia32_feature_control_valid_bits;
> >>> + ASSERT(!(bits & ~FEATURE_CONTROL_MAX_VALID_BITS));
> >>> + *valid_bits &= ~bits;
> >>> + if (!(*valid_bits & ~FEATURE_CONTROL_LOCKED))
> >>> + *valid_bits = 0;
> >>> +}
> >>> +
> >>>  #define VMCS12_OFFSET(x) offsetof(struct vmcs12, x)
> >>>  #define FIELD(number, name)  [number] = VMCS12_OFFSET(name)
> >>>  #define FIELD64(number, name)[number] = VMCS12_OFFSET(name), \
> >>> @@ -2864,6 +2897,14 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, 
> >>> u32 msr_index, u64 *pdata)
> >>>   return 0;
> >>>  }
> >>>  
> >>> +static inline bool vmx_feature_control_msr_valid(struct kvm_vcpu *vcpu,
> >>> +  uint64_t val)
> >>> +{
> >>> + uint64_t valid_bits = to_vmx(vcpu)->msr_ia32_feature_control_valid_bits;
> >>> +
> >>> + return valid_bits && !(val & ~valid_bits);
> >>> +}
> >>>  /*
> >>>   * Reads an msr value (of 'msr_index') into 'pdata'.
> >>>   * Returns 0 on success, non-0 otherwise.
> >>> @@ -2906,7 +2947,7 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, 
> >>> struct msr_data *msr_info)
> >>>   msr_info->data = vmcs_read64(GUEST_BNDCFGS);
> >>>   break;
> >>>   case MSR_IA32_FEATURE_CONTROL:
> >>> - if (!nested_vmx_allowed(vcpu))
> >>> + if (!vmx_feature_control_msr_valid(vcpu, 0))
> >>
> >> You can remove this if completely in patch 1.  It's okay to make the MSR
> >> always available.
> >>
> > 
> > But then it also allows all bits to be set by guests, even though some
> > features are not available.
> 
> Note that this is "get".  Of course the "if" must stay in vmx_set_msr.
>

My mistake. I'll remove it in patch 1.

Thanks,
Haozhong


Re: [PATCH v2 2/3] KVM: VMX: validate individual bits of guest MSR_IA32_FEATURE_CONTROL

2016-06-16 Thread Haozhong Zhang
On 06/16/16 11:55, Paolo Bonzini wrote:
> 
> 
> On 16/06/2016 08:05, Haozhong Zhang wrote:
> > +   /*
> > +* Only bits masked by msr_ia32_feature_control_valid_bits can be set in
> > +* msr_ia32_feature_control.
> > +*
> > +* msr_ia32_feature_control_valid_bits should be modified by
> > +* feature_control_valid_bits_add/del(), and only bits masked by
> > +* FEATURE_CONTROL_MAX_VALID_BITS can be modified.
> > +*/
> > u64 msr_ia32_feature_control;
> > +   u64 msr_ia32_feature_control_valid_bits;
> 
> I noticed that the fw_cfg patch used an uint32_t.  It probably should
> use uint64_t; what you did here is correct.
>

I'll fix there.

Thanks,
Haozhong


Re: [PATCH v2 2/3] KVM: VMX: validate individual bits of guest MSR_IA32_FEATURE_CONTROL

2016-06-16 Thread Haozhong Zhang
On 06/16/16 12:01, Paolo Bonzini wrote:
> 
> 
> On 16/06/2016 08:05, Haozhong Zhang wrote:
> > +/*
> > + * FEATURE_CONTROL_LOCKED is added/removed automatically by
> > + * feature_control_valid_bits_add/del(), so it's not included here.
> > + */
> > +#define FEATURE_CONTROL_MAX_VALID_BITS \
> > +   FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX
> > +
> > +static void feature_control_valid_bits_add(struct kvm_vcpu *vcpu, uint64_t 
> > bits)
> > +{
> > +   ASSERT(!(bits & ~FEATURE_CONTROL_MAX_VALID_BITS));
> 
> The KVM-specific ASSERT macro should go.  You can use WARN_ON.
>

will change to WARN_ON.

> However, I think FEATURE_CONTROL_LOCKED should always be writable.  If
> you change that, it's simpler to just do |= and &= in the caller.
>

These two functions (add/del) are to prevent callers from forgetting
setting/clearing FEATURE_CONTROL_LOCKED in
msr_ia32_feature_control_valid_bits: it should be set if any feature
bit is set, and be cleared if all feature bits are cleared. The second
rule could relaxed as we can always present MSR_IA32_FEATURE_CONTROL
to guest.

I'm okey to let callers take care for the locked bit.

> > +   to_vmx(vcpu)->msr_ia32_feature_control_valid_bits |=
> > +   bits | FEATURE_CONTROL_LOCKED;
> > +}
> > +
> > +static void feature_control_valid_bits_del(struct kvm_vcpu *vcpu, uint64_t 
> > bits)
> > +{
> > +   uint64_t *valid_bits =
> > +   &to_vmx(vcpu)->msr_ia32_feature_control_valid_bits;
> > +   ASSERT(!(bits & ~FEATURE_CONTROL_MAX_VALID_BITS));
> > +   *valid_bits &= ~bits;
> > +   if (!(*valid_bits & ~FEATURE_CONTROL_LOCKED))
> > +   *valid_bits = 0;
> > +}
> > +
> >  #define VMCS12_OFFSET(x) offsetof(struct vmcs12, x)
> >  #define FIELD(number, name)[number] = VMCS12_OFFSET(name)
> >  #define FIELD64(number, name)  [number] = VMCS12_OFFSET(name), \
> > @@ -2864,6 +2897,14 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, 
> > u32 msr_index, u64 *pdata)
> > return 0;
> >  }
> >  
> > +static inline bool vmx_feature_control_msr_valid(struct kvm_vcpu *vcpu,
> > +uint64_t val)
> > +{
> > +   uint64_t valid_bits = to_vmx(vcpu)->msr_ia32_feature_control_valid_bits;
> > +
> > +   return valid_bits && !(val & ~valid_bits);
> > +}
> >  /*
> >   * Reads an msr value (of 'msr_index') into 'pdata'.
> >   * Returns 0 on success, non-0 otherwise.
> > @@ -2906,7 +2947,7 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct 
> > msr_data *msr_info)
> > msr_info->data = vmcs_read64(GUEST_BNDCFGS);
> > break;
> > case MSR_IA32_FEATURE_CONTROL:
> > -   if (!nested_vmx_allowed(vcpu))
> > +   if (!vmx_feature_control_msr_valid(vcpu, 0))
> 
> You can remove this if completely in patch 1.  It's okay to make the MSR
> always available.
>

But then it also allows all bits to be set by guests, even though some
features are not available. Though this problem already exists before
Patch 1, I prefer to leave it as was in Patch 1 and fix it here.

Haozhong


Re: [PATCH v2 3/3] KVM: VMX: enable guest access to LMCE related MSRs

2016-06-16 Thread Haozhong Zhang
On 06/16/16 12:04, Paolo Bonzini wrote:
> 
> 
> On 16/06/2016 08:05, Haozhong Zhang wrote:
> > From: Ashok Raj 
> > 
> > On Intel platforms, this patch adds LMCE to KVM MCE supported
> > capabilities and handles guest access to LMCE related MSRs.
> > 
> > Signed-off-by: Ashok Raj 
> > [Haozhong: macro KVM_MCE_CAP_SUPPORTED => variable kvm_mce_cap_supported
> >Only enable LMCE on Intel platform
> >Check MSR_IA32_FEATURE_CONTROL when handling guest
> >      access to MSR_IA32_MCG_EXT_CTL]
> > Signed-off-by: Haozhong Zhang 
> > ---
> >  arch/x86/include/asm/kvm_host.h |  5 +
> >  arch/x86/kvm/vmx.c  | 36 +++-
> >  arch/x86/kvm/x86.c  | 15 +--
> >  3 files changed, 49 insertions(+), 7 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h 
> > b/arch/x86/include/asm/kvm_host.h
> > index e0fbe7e..75defa6 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -598,6 +598,7 @@ struct kvm_vcpu_arch {
> > u64 mcg_cap;
> > u64 mcg_status;
> > u64 mcg_ctl;
> > +   u64 mcg_ext_ctl;
> > u64 *mce_banks;
> >  
> > /* Cache MMIO info */
> > @@ -1005,6 +1006,8 @@ struct kvm_x86_ops {
> > int (*update_pi_irte)(struct kvm *kvm, unsigned int host_irq,
> >   uint32_t guest_irq, bool set);
> > void (*apicv_post_state_restore)(struct kvm_vcpu *vcpu);
> > +
> > +   void (*setup_mce)(struct kvm_vcpu *vcpu);
> >  };
> >  
> >  struct kvm_arch_async_pf {
> > @@ -1077,6 +1080,8 @@ extern u8   kvm_tsc_scaling_ratio_frac_bits;
> >  /* maximum allowed value of TSC scaling ratio */
> >  extern u64  kvm_max_tsc_scaling_ratio;
> >  
> > +extern u64 kvm_mce_cap_supported;
> > +
> >  enum emulation_result {
> > EMULATE_DONE, /* no further processing */
> > EMULATE_USER_EXIT,/* kvm_run ready for userspace exit */
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 1dc89c5..42db42e 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -638,7 +638,7 @@ static struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu 
> > *vcpu)
> >   * feature_control_valid_bits_add/del(), so it's not included here.
> >   */
> >  #define FEATURE_CONTROL_MAX_VALID_BITS \
> > -   FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX
> > +   (FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX | FEATURE_CONTROL_LMCE)
> >  
> >  static void feature_control_valid_bits_add(struct kvm_vcpu *vcpu, uint64_t 
> > bits)
> >  {
> > @@ -2905,6 +2905,15 @@ static inline bool 
> > vmx_feature_control_msr_valid(struct kvm_vcpu *vcpu,
> > return valid_bits && !(val & ~valid_bits);
> >  }
> >  
> > +static inline bool vmx_mcg_ext_ctrl_msr_present(struct kvm_vcpu *vcpu,
> > +   bool host_initiated)
> > +{
> > +   return (vcpu->arch.mcg_cap & MCG_LMCE_P) &&
> 
> Checking MCG_LMCE_P is unnecessary, because you cannot set
> FEATURE_CONTROL_LMCE unless MCG_LMCE_P is present.
> 
> You can just inline this function in the callers, it's simpler.

I'll remove the first line check and inline the other parts.

> 
> > +   (host_initiated ||
> > +(to_vmx(vcpu)->msr_ia32_feature_control &
> > + FEATURE_CONTROL_LMCE));
> > +}
> > +
> >  /*
> >   * Reads an msr value (of 'msr_index') into 'pdata'.
> >   * Returns 0 on success, non-0 otherwise.
> > @@ -2946,6 +2955,12 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct 
> > msr_data *msr_info)
> > return 1;
> > msr_info->data = vmcs_read64(GUEST_BNDCFGS);
> > break;
> > +   case MSR_IA32_MCG_EXT_CTL:
> > +   if (!vmx_mcg_ext_ctrl_msr_present(vcpu,
> > + msr_info->host_initiated))
> > +   return 1;
> > +   msr_info->data = vcpu->arch.mcg_ext_ctl;
> > +   break;
> > case MSR_IA32_FEATURE_CONTROL:
> > if (!vmx_feature_control_msr_valid(vcpu, 0))
> > return 1;
> > @@ -3039,6 +3054,13 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct 
> > msr_data *msr_info)
> > case MSR_IA32_TSC_ADJUST:
> > ret = kvm_set_msr_common(vcpu, msr_info);
> > break;
> > +  

[PATCH v2 3/3] KVM: VMX: enable guest access to LMCE related MSRs

2016-06-15 Thread Haozhong Zhang
From: Ashok Raj 

On Intel platforms, this patch adds LMCE to KVM MCE supported
capabilities and handles guest access to LMCE related MSRs.

Signed-off-by: Ashok Raj 
[Haozhong: macro KVM_MCE_CAP_SUPPORTED => variable kvm_mce_cap_supported
   Only enable LMCE on Intel platform
   Check MSR_IA32_FEATURE_CONTROL when handling guest
 access to MSR_IA32_MCG_EXT_CTL]
Signed-off-by: Haozhong Zhang 
---
 arch/x86/include/asm/kvm_host.h |  5 +
 arch/x86/kvm/vmx.c  | 36 +++-
 arch/x86/kvm/x86.c  | 15 +--
 3 files changed, 49 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e0fbe7e..75defa6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -598,6 +598,7 @@ struct kvm_vcpu_arch {
u64 mcg_cap;
u64 mcg_status;
u64 mcg_ctl;
+   u64 mcg_ext_ctl;
u64 *mce_banks;
 
/* Cache MMIO info */
@@ -1005,6 +1006,8 @@ struct kvm_x86_ops {
int (*update_pi_irte)(struct kvm *kvm, unsigned int host_irq,
  uint32_t guest_irq, bool set);
void (*apicv_post_state_restore)(struct kvm_vcpu *vcpu);
+
+   void (*setup_mce)(struct kvm_vcpu *vcpu);
 };
 
 struct kvm_arch_async_pf {
@@ -1077,6 +1080,8 @@ extern u8   kvm_tsc_scaling_ratio_frac_bits;
 /* maximum allowed value of TSC scaling ratio */
 extern u64  kvm_max_tsc_scaling_ratio;
 
+extern u64 kvm_mce_cap_supported;
+
 enum emulation_result {
EMULATE_DONE, /* no further processing */
EMULATE_USER_EXIT,/* kvm_run ready for userspace exit */
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 1dc89c5..42db42e 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -638,7 +638,7 @@ static struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu 
*vcpu)
  * feature_control_valid_bits_add/del(), so it's not included here.
  */
 #define FEATURE_CONTROL_MAX_VALID_BITS \
-   FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX
+   (FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX | FEATURE_CONTROL_LMCE)
 
 static void feature_control_valid_bits_add(struct kvm_vcpu *vcpu, uint64_t 
bits)
 {
@@ -2905,6 +2905,15 @@ static inline bool vmx_feature_control_msr_valid(struct 
kvm_vcpu *vcpu,
return valid_bits && !(val & ~valid_bits);
 }
 
+static inline bool vmx_mcg_ext_ctrl_msr_present(struct kvm_vcpu *vcpu,
+   bool host_initiated)
+{
+   return (vcpu->arch.mcg_cap & MCG_LMCE_P) &&
+   (host_initiated ||
+(to_vmx(vcpu)->msr_ia32_feature_control &
+ FEATURE_CONTROL_LMCE));
+}
+
 /*
  * Reads an msr value (of 'msr_index') into 'pdata'.
  * Returns 0 on success, non-0 otherwise.
@@ -2946,6 +2955,12 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
return 1;
msr_info->data = vmcs_read64(GUEST_BNDCFGS);
break;
+   case MSR_IA32_MCG_EXT_CTL:
+   if (!vmx_mcg_ext_ctrl_msr_present(vcpu,
+ msr_info->host_initiated))
+   return 1;
+   msr_info->data = vcpu->arch.mcg_ext_ctl;
+   break;
case MSR_IA32_FEATURE_CONTROL:
if (!vmx_feature_control_msr_valid(vcpu, 0))
return 1;
@@ -3039,6 +3054,13 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
case MSR_IA32_TSC_ADJUST:
ret = kvm_set_msr_common(vcpu, msr_info);
break;
+   case MSR_IA32_MCG_EXT_CTL:
+   if (!vmx_mcg_ext_ctrl_msr_present(vcpu,
+ msr_info->host_initiated) ||
+   (data & ~MCG_EXT_CTL_LMCE_EN))
+   return 1;
+   vcpu->arch.mcg_ext_ctl = data;
+   break;
case MSR_IA32_FEATURE_CONTROL:
if (!vmx_feature_control_msr_valid(vcpu, data) ||
(to_vmx(vcpu)->msr_ia32_feature_control &
@@ -6433,6 +6455,8 @@ static __init int hardware_setup(void)
 
kvm_set_posted_intr_wakeup_handler(wakeup_handler);
 
+   kvm_mce_cap_supported |= MCG_LMCE_P;
+
return alloc_kvm_area();
 
 out8:
@@ -10950,6 +10974,14 @@ out:
return ret;
 }
 
+static void vmx_setup_mce(struct kvm_vcpu *vcpu)
+{
+   if (vcpu->arch.mcg_cap & MCG_LMCE_P)
+   feature_control_valid_bits_add(vcpu, FEATURE_CONTROL_LMCE);
+   else
+   feature_control_valid_bits_del(vcpu, FEATURE_CONTROL_LMCE);
+}
+
 static struct kvm_x86_ops vmx_x86_ops = {
.cpu_has_kvm_support = cpu_has_kvm_support,
.disabled_by_bios = vmx_disabled_by_bios,
@@ -11074,6 +11106,8 @@ stat

[PATCH v2 1/3] KVM: VMX: move msr_ia32_feature_control to vcpu_vmx

2016-06-15 Thread Haozhong Zhang
msr_ia32_feature_control will be used for LMCE and not depend only on
nested anymore, so move it from struct nested_vmx to struct vcpu_vmx.

Signed-off-by: Haozhong Zhang 
---
 arch/x86/kvm/vmx.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 57ec6a4..6b63f2d 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -421,7 +421,6 @@ struct nested_vmx {
struct pi_desc *pi_desc;
bool pi_pending;
u16 posted_intr_nv;
-   u64 msr_ia32_feature_control;
 
struct hrtimer preemption_timer;
bool preemption_timer_expired;
@@ -602,6 +601,8 @@ struct vcpu_vmx {
bool guest_pkru_valid;
u32 guest_pkru;
u32 host_pkru;
+
+   u64 msr_ia32_feature_control;
 };
 
 enum segment_cache_field {
@@ -2907,7 +2908,7 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
case MSR_IA32_FEATURE_CONTROL:
if (!nested_vmx_allowed(vcpu))
return 1;
-   msr_info->data = to_vmx(vcpu)->nested.msr_ia32_feature_control;
+   msr_info->data = to_vmx(vcpu)->msr_ia32_feature_control;
break;
case MSR_IA32_VMX_BASIC ... MSR_IA32_VMX_VMFUNC:
if (!nested_vmx_allowed(vcpu))
@@ -2999,10 +3000,10 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
break;
case MSR_IA32_FEATURE_CONTROL:
if (!nested_vmx_allowed(vcpu) ||
-   (to_vmx(vcpu)->nested.msr_ia32_feature_control &
+   (to_vmx(vcpu)->msr_ia32_feature_control &
 FEATURE_CONTROL_LOCKED && !msr_info->host_initiated))
return 1;
-   vmx->nested.msr_ia32_feature_control = data;
+   vmx->msr_ia32_feature_control = data;
if (msr_info->host_initiated && data == 0)
vmx_leave_nested(vcpu);
break;
@@ -6859,7 +6860,7 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
return 1;
}
 
-   if ((vmx->nested.msr_ia32_feature_control & VMXON_NEEDED_FEATURES)
+   if ((vmx->msr_ia32_feature_control & VMXON_NEEDED_FEATURES)
!= VMXON_NEEDED_FEATURES) {
kvm_inject_gp(vcpu, 0);
return 1;
-- 
2.9.0



[PATCH v2 2/3] KVM: VMX: validate individual bits of guest MSR_IA32_FEATURE_CONTROL

2016-06-15 Thread Haozhong Zhang
KVM currently does not check the value written to guest
MSR_IA32_FEATURE_CONTROL, though bits corresponding to disabled features
may be set. This patch makes KVM to validate individual bits written to
guest MSR_IA32_FEATURE_CONTROL according to enabled features.

Signed-off-by: Haozhong Zhang 
---
 arch/x86/kvm/vmx.c | 52 ++--
 1 file changed, 50 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 6b63f2d..1dc89c5 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -602,7 +602,16 @@ struct vcpu_vmx {
u32 guest_pkru;
u32 host_pkru;
 
+   /*
+* Only bits masked by msr_ia32_feature_control_valid_bits can be set in
+* msr_ia32_feature_control.
+*
+* msr_ia32_feature_control_valid_bits should be modified by
+* feature_control_valid_bits_add/del(), and only bits masked by
+* FEATURE_CONTROL_MAX_VALID_BITS can be modified.
+*/
u64 msr_ia32_feature_control;
+   u64 msr_ia32_feature_control_valid_bits;
 };
 
 enum segment_cache_field {
@@ -624,6 +633,30 @@ static struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu 
*vcpu)
return &(to_vmx(vcpu)->pi_desc);
 }
 
+/*
+ * FEATURE_CONTROL_LOCKED is added/removed automatically by
+ * feature_control_valid_bits_add/del(), so it's not included here.
+ */
+#define FEATURE_CONTROL_MAX_VALID_BITS \
+   FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX
+
+static void feature_control_valid_bits_add(struct kvm_vcpu *vcpu, uint64_t 
bits)
+{
+   ASSERT(!(bits & ~FEATURE_CONTROL_MAX_VALID_BITS));
+   to_vmx(vcpu)->msr_ia32_feature_control_valid_bits |=
+   bits | FEATURE_CONTROL_LOCKED;
+}
+
+static void feature_control_valid_bits_del(struct kvm_vcpu *vcpu, uint64_t 
bits)
+{
+   uint64_t *valid_bits =
+   &to_vmx(vcpu)->msr_ia32_feature_control_valid_bits;
+   ASSERT(!(bits & ~FEATURE_CONTROL_MAX_VALID_BITS));
+   *valid_bits &= ~bits;
+   if (!(*valid_bits & ~FEATURE_CONTROL_LOCKED))
+   *valid_bits = 0;
+}
+
 #define VMCS12_OFFSET(x) offsetof(struct vmcs12, x)
 #define FIELD(number, name)[number] = VMCS12_OFFSET(name)
 #define FIELD64(number, name)  [number] = VMCS12_OFFSET(name), \
@@ -2864,6 +2897,14 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 
msr_index, u64 *pdata)
return 0;
 }
 
+static inline bool vmx_feature_control_msr_valid(struct kvm_vcpu *vcpu,
+uint64_t val)
+{
+   uint64_t valid_bits = to_vmx(vcpu)->msr_ia32_feature_control_valid_bits;
+
+   return valid_bits && !(val & ~valid_bits);
+}
+
 /*
  * Reads an msr value (of 'msr_index') into 'pdata'.
  * Returns 0 on success, non-0 otherwise.
@@ -2906,7 +2947,7 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
msr_info->data = vmcs_read64(GUEST_BNDCFGS);
break;
case MSR_IA32_FEATURE_CONTROL:
-   if (!nested_vmx_allowed(vcpu))
+   if (!vmx_feature_control_msr_valid(vcpu, 0))
return 1;
msr_info->data = to_vmx(vcpu)->msr_ia32_feature_control;
break;
@@ -2999,7 +3040,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
ret = kvm_set_msr_common(vcpu, msr_info);
break;
case MSR_IA32_FEATURE_CONTROL:
-   if (!nested_vmx_allowed(vcpu) ||
+   if (!vmx_feature_control_msr_valid(vcpu, data) ||
(to_vmx(vcpu)->msr_ia32_feature_control &
 FEATURE_CONTROL_LOCKED && !msr_info->host_initiated))
return 1;
@@ -9095,6 +9136,13 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu)
vmx->nested.nested_vmx_secondary_ctls_high &=
~SECONDARY_EXEC_PCOMMIT;
}
+
+   if (nested_vmx_allowed(vcpu))
+   feature_control_valid_bits_add
+   (vcpu, FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX);
+   else
+   feature_control_valid_bits_del
+   (vcpu, FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX);
 }
 
 static void vmx_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
-- 
2.9.0



[PATCH v2 0/3] Add KVM support for Intel local MCE

2016-06-15 Thread Haozhong Zhang
Changes in v2:
 * v1 Patch 1 becomes v2 Patch 3.
 * Fix COB chain in Patch 3. (Boris Petkov)
 * (Patch 1) Move msr_ia32_feature_control from nested_vmx to
   vcpu_vmx, because it does not depend only on nested after this
   patch series. (Radim Krčmář)
 * (Patch 2) Add a valid bitmask for MSR_IA32_FEATURE_CONTROL to allow
   checking individual bits of MSR_IA32_FEATURE_CONTROL according to
   enabled features. (Radim Krčmář)
 * Move the common check in handling MSR_IA32_MCG_EXT_CTL to function
   vmx_mcg_ext_ctl_msr_present. (Radim Krčmář)

Changes in v1:
 * Change macro KVM_MCE_CAP_SUPPORTED to variable kvm_mce_cap_supported.
 * Include LMCE capability in kvm_mce_cap_supported only on Intel CPU,
   i.e. LMCE can be enabled only on Intel CPU.
 * Check if LMCE is enabled in guest MSR_IA32_FEATURE_CONTROL when
   handling guest access to MSR_IA32_MCG_EXT_CTL.

This patch series along with the corresponding QEMU patch series (sent
via another email with title "[PATCH v4 0/3] Add QEMU support for
Intel local MCE") enables Intel local MCE feature for guest. This KVM
patch handles guest access to LMCE-related MSR (MSR_IA32_MCG_EXT_CTL
and MSR_IA32_FEATURE_CONTROL).

Ashok Raj (1):
  KVM: VMX: enable guest access to LMCE related MSRs

Haozhong Zhang (2):
  KVM: VMX: move msr_ia32_feature_control to vcpu_vmx
  KVM: VMX: validate individual bits of guest MSR_IA32_FEATURE_CONTROL

 arch/x86/include/asm/kvm_host.h |  5 +++
 arch/x86/kvm/vmx.c  | 97 ++---
 arch/x86/kvm/x86.c  | 15 ---
 3 files changed, 104 insertions(+), 13 deletions(-)

-- 
2.9.0



Re: [PATCH v1] KVM: VMX: enable guest access to LMCE related MSRs

2016-06-05 Thread Haozhong Zhang
On 06/05/16 21:43, Borislav Petkov wrote:
> On Sun, Jun 05, 2016 at 11:14:56PM +0800, Haozhong Zhang wrote:
> > Ashok was also involved in the development of v1 patch and it's based
> > on his v0 patch, so I think I should take his SOB?
> 
> You have at least three options:
> 
> 1.
> From: Author Name 
> 
> ...
> 
> Signed-off-by: Author Name 
> [ Submitter did this and that to the Author's patch ]
> Signed-off-by: Submitter Name 
>

Thanks! I'll take this option.

Haozhong

> So you either do that or you say something like
>
> 2. "Based on an original patch by Ashok..."
> 
> in the commit message
> 
> or
> 
> you add the tag
> 
> 3. Originally-by: Ashok..
> 
> Makes sense?
> 
> For more info see Documentation/SubmittingPatches.
> 
> -- 
> Regards/Gruss,
> Boris.
> 
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 
> 21284 (AG Nürnberg)
> -- 


Re: [PATCH v1] KVM: VMX: enable guest access to LMCE related MSRs

2016-06-05 Thread Haozhong Zhang
On 06/04/16 13:01, Boris Petkov wrote:
> Haozhong Zhang  wrote:
> 
> >On Intel platforms, this patch adds LMCE to KVM MCE supported
> >capabilities and handles guest access to LMCE related MSRs.
> >
> >Signed-off-by: Ashok Raj 
> >Signed-off-by: Haozhong Zhang 
> 
> SOB chain needs correction wrt who the author is and who the submitter. 
> 

Ashok was also involved in the development of v1 patch and it's based
on his v0 patch, so I think I should take his SOB?

Thanks,
Haozhong


Re: [PATCH v1] KVM: VMX: enable guest access to LMCE related MSRs

2016-06-05 Thread Haozhong Zhang
On 06/03/16 17:34, Radim Krčmář wrote:
> 2016-06-03 14:08+0800, Haozhong Zhang:
> > On Intel platforms, this patch adds LMCE to KVM MCE supported
> > capabilities and handles guest access to LMCE related MSRs.
> > 
> > Signed-off-by: Ashok Raj 
> > Signed-off-by: Haozhong Zhang 
> > ---
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > @@ -2863,6 +2863,11 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, 
> > u32 msr_index, u64 *pdata)
> > return 0;
> >  }
> >  
> > +static inline bool vmx_feature_control_msr_required(struct kvm_vcpu *vcpu)
> 
> I'd call it "present", rather than "required".  SDM does so for other
> MSRs and it is easier to understand in the condition that returns #GP if
> this function is false.
>

Agree, I'll change.

> > +{
> > +   return nested_vmx_allowed(vcpu) || (vcpu->arch.mcg_cap & MCG_LMCE_P);
> > +}
> > @@ -2904,8 +2909,15 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct 
> > msr_data *msr_info)
> > case MSR_IA32_FEATURE_CONTROL:
> > -   if (!nested_vmx_allowed(vcpu))
> > +   if (!vmx_feature_control_msr_required(vcpu))
> > return 1;
> > msr_info->data = to_vmx(vcpu)->nested.msr_ia32_feature_control;
> 
> (MSR_IA32_FEATURE_CONTROL does not depend only on nested anymore, so
>  moving msr_ia32_feature_control from struct nested_vmx to struct
>  vcpu_vmx would make sense.)
>

will move in the next version

> > break;
> > @@ -2997,8 +3009,17 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct 
> > msr_data *msr_info)
> > +   case MSR_IA32_MCG_EXT_CTL:
> > +   if (!(vcpu->arch.mcg_cap & MCG_LMCE_P) ||
> > +   !(to_vmx(vcpu)->nested.msr_ia32_feature_control &
> > + FEATURE_CONTROL_LMCE))
> 
> (This check is used twice and could be given a name too,
>  "vmx_mcg_ext_ctl_msr_present()"?)
>

will change

> > +   return 1;
> > +   if (data && data != 0x1)
> 
>   (data & ~MCG_EXT_CTL_LMCE_EN)
> 
> is a clearer check for reserved bits.
>

ditto

> > +   return -1;
> > +   vcpu->arch.mcg_ext_ctl = data;
> > +   break;
> > case MSR_IA32_FEATURE_CONTROL:
> > -   if (!nested_vmx_allowed(vcpu) ||
> > +   if (!vmx_feature_control_msr_required(vcpu) ||
> > (to_vmx(vcpu)->nested.msr_ia32_feature_control &
> >  FEATURE_CONTROL_LOCKED && !msr_info->host_initiated))
> > return 1;
> 
> Does hardware throw #GP when FEATURE_CONTROL_LMCE is set without
> MCG_LMCE_P?
>

Yes, SDM vol 2 says setting reserved bits of MSR causes #GP.

> (We could emulate that by having a mask of valid bits and also use that
>  mask in place of vmx_feature_control_msr_required().  I don't think
>  there is a reason to have only vmx_feature_control_msr_required() if
>  the hardware can #GP on individual bits too.)
>

Oh yes, I should also validate the individual bits. I'll add it in the
next version.

Thanks,
Haozhong


[PATCH v1] KVM: VMX: enable guest access to LMCE related MSRs

2016-06-02 Thread Haozhong Zhang
On Intel platforms, this patch adds LMCE to KVM MCE supported
capabilities and handles guest access to LMCE related MSRs.

Signed-off-by: Ashok Raj 
Signed-off-by: Haozhong Zhang 
---
 arch/x86/include/asm/kvm_host.h |  3 +++
 arch/x86/kvm/vmx.c  | 27 +--
 arch/x86/kvm/x86.c  | 12 ++--
 3 files changed, 34 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e0fbe7e..89509f9d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -598,6 +598,7 @@ struct kvm_vcpu_arch {
u64 mcg_cap;
u64 mcg_status;
u64 mcg_ctl;
+   u64 mcg_ext_ctl;
u64 *mce_banks;
 
/* Cache MMIO info */
@@ -1077,6 +1078,8 @@ extern u8   kvm_tsc_scaling_ratio_frac_bits;
 /* maximum allowed value of TSC scaling ratio */
 extern u64  kvm_max_tsc_scaling_ratio;
 
+extern u64 kvm_mce_cap_supported;
+
 enum emulation_result {
EMULATE_DONE, /* no further processing */
EMULATE_USER_EXIT,/* kvm_run ready for userspace exit */
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index fb93010..42c3ee1 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2863,6 +2863,11 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 
msr_index, u64 *pdata)
return 0;
 }
 
+static inline bool vmx_feature_control_msr_required(struct kvm_vcpu *vcpu)
+{
+   return nested_vmx_allowed(vcpu) || (vcpu->arch.mcg_cap & MCG_LMCE_P);
+}
+
 /*
  * Reads an msr value (of 'msr_index') into 'pdata'.
  * Returns 0 on success, non-0 otherwise.
@@ -2904,8 +2909,15 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
return 1;
msr_info->data = vmcs_read64(GUEST_BNDCFGS);
break;
+   case MSR_IA32_MCG_EXT_CTL:
+   if (!(vcpu->arch.mcg_cap & MCG_LMCE_P) ||
+   !(to_vmx(vcpu)->nested.msr_ia32_feature_control &
+ FEATURE_CONTROL_LMCE))
+   return 1;
+   msr_info->data = vcpu->arch.mcg_ext_ctl;
+   break;
case MSR_IA32_FEATURE_CONTROL:
-   if (!nested_vmx_allowed(vcpu))
+   if (!vmx_feature_control_msr_required(vcpu))
return 1;
msr_info->data = to_vmx(vcpu)->nested.msr_ia32_feature_control;
break;
@@ -2997,8 +3009,17 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
case MSR_IA32_TSC_ADJUST:
ret = kvm_set_msr_common(vcpu, msr_info);
break;
+   case MSR_IA32_MCG_EXT_CTL:
+   if (!(vcpu->arch.mcg_cap & MCG_LMCE_P) ||
+   !(to_vmx(vcpu)->nested.msr_ia32_feature_control &
+ FEATURE_CONTROL_LMCE))
+   return 1;
+   if (data && data != 0x1)
+   return -1;
+   vcpu->arch.mcg_ext_ctl = data;
+   break;
case MSR_IA32_FEATURE_CONTROL:
-   if (!nested_vmx_allowed(vcpu) ||
+   if (!vmx_feature_control_msr_required(vcpu) ||
(to_vmx(vcpu)->nested.msr_ia32_feature_control &
 FEATURE_CONTROL_LOCKED && !msr_info->host_initiated))
return 1;
@@ -6391,6 +6412,8 @@ static __init int hardware_setup(void)
 
kvm_set_posted_intr_wakeup_handler(wakeup_handler);
 
+   kvm_mce_cap_supported |= MCG_LMCE_P;
+
return alloc_kvm_area();
 
 out8:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 902d9da..0cc8f00 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -70,7 +70,8 @@
 
 #define MAX_IO_MSRS 256
 #define KVM_MAX_MCE_BANKS 32
-#define KVM_MCE_CAP_SUPPORTED (MCG_CTL_P | MCG_SER_P)
+u64 __read_mostly kvm_mce_cap_supported = MCG_CTL_P | MCG_SER_P;
+EXPORT_SYMBOL_GPL(kvm_mce_cap_supported);
 
 #define emul_to_vcpu(ctxt) \
container_of(ctxt, struct kvm_vcpu, arch.emulate_ctxt)
@@ -982,6 +983,7 @@ static u32 emulated_msrs[] = {
MSR_IA32_MISC_ENABLE,
MSR_IA32_MCG_STATUS,
MSR_IA32_MCG_CTL,
+   MSR_IA32_MCG_EXT_CTL,
MSR_IA32_SMBASE,
 };
 
@@ -2683,11 +2685,9 @@ long kvm_arch_dev_ioctl(struct file *filp,
break;
}
case KVM_X86_GET_MCE_CAP_SUPPORTED: {
-   u64 mce_cap;
-
-   mce_cap = KVM_MCE_CAP_SUPPORTED;
r = -EFAULT;
-   if (copy_to_user(argp, &mce_cap, sizeof mce_cap))
+   if (copy_to_user(argp, &kvm_mce_cap_supported,
+sizeof(kvm_mce_cap_supported)))
goto out;
r = 0;
break;
@@ -2865,7 +2865,7 @@ static int kvm_vcpu_ioctl_x86_setup_mce(struct kvm_vcpu 
*vcpu,
  

[PATCH v1] Add KVM support for Intel local MCE

2016-06-02 Thread Haozhong Zhang
Previous v0 patch can be found at
http://permalink.gmane.org/gmane.linux.kernel/2104347

Changes in v1:
 * Change macro KVM_MCE_CAP_SUPPORTED to variable kvm_mce_cap_supported.
 * Include LMCE capability in kvm_mce_cap_supported only on Intel CPU,
   i.e. LMCE can be enabled only on Intel CPU.
 * Check if LMCE is enabled in guest MSR_IA32_FEATURE_CONTROL when
   handling guest access to MSR_IA32_MCG_EXT_CTL.

This patch along with the corresponding QEMU patch series (sent via
another email with title "[PATCH v3 0/2] Add QEMU support for Intel
local MCE") enables Intel local MCE feature for guest. This KVM patch
handles guest access to LMCE-related MSR (MSR_IA32_MCG_EXT_CTL and
MSR_IA32_FEATURE_CONTROL).

Haozhong Zhang (1):
  KVM: VMX: enable guest access to LMCE related MSRs

 arch/x86/include/asm/kvm_host.h |  3 +++
 arch/x86/kvm/vmx.c  | 27 +--
 arch/x86/kvm/x86.c  | 12 ++--
 3 files changed, 34 insertions(+), 8 deletions(-)

-- 
2.8.3



[PATCH] KVM: VMX: Fix host initiated access to guest MSR_TSC_AUX

2015-12-14 Thread Haozhong Zhang
The current handling of accesses to guest MSR_TSC_AUX returns error if
vcpu does not support rdtscp, though those accesses are initiated by
host. This can result in the reboot failure of some versions of
QEMU. This patch fixes this issue by passing those host initiated
accesses for further handling instead.

Signed-off-by: Haozhong Zhang 
---
 arch/x86/kvm/vmx.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 1a8bfaa..50f2b78 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2802,7 +2802,7 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
msr_info->data = vcpu->arch.ia32_xss;
break;
case MSR_TSC_AUX:
-   if (!guest_cpuid_has_rdtscp(vcpu))
+   if (!guest_cpuid_has_rdtscp(vcpu) && !msr_info->host_initiated)
return 1;
/* Otherwise falls through */
default:
@@ -2908,7 +2908,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
clear_atomic_switch_msr(vmx, MSR_IA32_XSS);
break;
case MSR_TSC_AUX:
-   if (!guest_cpuid_has_rdtscp(vcpu))
+   if (!guest_cpuid_has_rdtscp(vcpu) && !msr_info->host_initiated)
return 1;
/* Check reserved bit, higher 32 bits should be zero */
if ((data >> 32) != 0)
-- 
2.6.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] KVM: nVMX: remove incorrect vpid check in nested invvpid emulation

2015-11-25 Thread Haozhong Zhang
On 11/25/15 10:45, Bandan Das wrote:
> Haozhong Zhang  writes:
> 
> > This patch removes the vpid check when emulating nested invvpid
> > instruction of type all-contexts invalidation. The existing code is
> > incorrect because:
> >  (1) According to Intel SDM Vol 3, Section "INVVPID - Invalidate
> >  Translations Based on VPID", invvpid instruction does not check
> >  vpid in the invvpid descriptor when its type is all-contexts
> >  invalidation.
> 
> But iirc isn't vpid=0 reserved for root mode ?
Yes,

> I think we don't want
> L1 hypervisor to be able do a invvpid(0).

but the invvpid emulated here is doing the all-contexts invalidation that
does not use the given vpid and "invalidates all mappings tagged with all
VPIDs except VPID H" (from Intel SDM).

> 
> >  (2) According to the same document, invvpid of type all-contexts
> >  invalidation does not require there is an active VMCS, so/and
> >  get_vmcs12() in the existing code may result in a NULL-pointer
> >  dereference. In practice, it can crash both KVM itself and L1
> >  hypervisors that use invvpid (e.g. Xen).
> 
> If that is the case, then just check if it's null and return without
> doing anything.

(according to Intel SDM) invvpid of type all-contexts invalidation
should not trigger a valid vmx fail if vpid in the current VMCS is 0.
However, this check and its following operation do change this semantics
in nested VMX, so it should be completely removed.

> 
> > Signed-off-by: Haozhong Zhang 
> > ---
> >  arch/x86/kvm/vmx.c | 5 -
> >  1 file changed, 5 deletions(-)
> >
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 87acc52..af823a3 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -7394,11 +7394,6 @@ static int handle_invvpid(struct kvm_vcpu *vcpu)
> >  
> > switch (type) {
> > case VMX_VPID_EXTENT_ALL_CONTEXT:
> > -   if (get_vmcs12(vcpu)->virtual_processor_id == 0) {
> > -   nested_vmx_failValid(vcpu,
> > -   VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
> > -   return 1;
> > -   }
> > __vmx_flush_tlb(vcpu, to_vmx(vcpu)->nested.vpid02);
> > nested_vmx_succeed(vcpu);
> > break;
> 
> I also noticed a BUG() here in the default. It might be a good idea to replace
> it with a WARN.

Or, in nested_vmx_setup_ctls_msrs():

if (enable_vpid)
-   vmx->nested.nested_vmx_vpid_caps = VMX_VPID_INVVPID_BIT |
-   VMX_VPID_EXTENT_GLOBAL_CONTEXT_BIT;
+   vmx->nested.nested_vmx_vpid_caps = VMX_VPID_EXTENT_GLOBAL_CONTEXT_BIT;

because the current handle_invvpid() only handles all-contexts invalidation.

Haozhong

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] KVM: nVMX: remove incorrect vpid check in nested invvpid emulation

2015-11-25 Thread Haozhong Zhang
This patch removes the vpid check when emulating nested invvpid
instruction of type all-contexts invalidation. The existing code is
incorrect because:
 (1) According to Intel SDM Vol 3, Section "INVVPID - Invalidate
 Translations Based on VPID", invvpid instruction does not check
 vpid in the invvpid descriptor when its type is all-contexts
 invalidation.
 (2) According to the same document, invvpid of type all-contexts
 invalidation does not require there is an active VMCS, so/and
 get_vmcs12() in the existing code may result in a NULL-pointer
 dereference. In practice, it can crash both KVM itself and L1
 hypervisors that use invvpid (e.g. Xen).

Signed-off-by: Haozhong Zhang 
---
 arch/x86/kvm/vmx.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 87acc52..af823a3 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -7394,11 +7394,6 @@ static int handle_invvpid(struct kvm_vcpu *vcpu)
 
switch (type) {
case VMX_VPID_EXTENT_ALL_CONTEXT:
-   if (get_vmcs12(vcpu)->virtual_processor_id == 0) {
-   nested_vmx_failValid(vcpu,
-   VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
-   return 1;
-   }
__vmx_flush_tlb(vcpu, to_vmx(vcpu)->nested.vpid02);
nested_vmx_succeed(vcpu);
break;
-- 
2.4.8

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/12] KVM: x86: add support for VMX TSC scaling

2015-11-08 Thread haozhong . zhang
On 11/06/15 21:40, Paolo Bonzini wrote:
> 
> 
> On 06/11/2015 13:42, Haozhong Zhang wrote:
> > On 11/06/15 11:49, Paolo Bonzini wrote:
> >>
> >>
> >> On 20/10/2015 09:39, Haozhong Zhang wrote:
> >>> This patchset adds support for VMX TSC scaling feature which is
> >>> available on Intel Skylake CPU. The specification of VMX TSC scaling
> >>> can be found at
> >>> http://www.intel.com/content/www/us/en/processors/timestamp-counter-scaling-virtualization-white-paper.html
> >>>
> >>> VMX TSC scaling allows guest TSC which is read by guest rdtsc(p)
> >>> instructions increases in a rate that is customized by the hypervisor
> >>> and can be different than the host TSC rate. Basically, VMX TSC
> >>> scaling adds a 64-bit field called TSC multiplier in VMCS so that, if
> >>> VMX TSC scaling is enabled, TSC read by guest rdtsc(p) instructions
> >>> will be calculated by the following formula:
> >>>
> >>>   guest EDX:EAX = (Host TSC * TSC multiplier) >> 48 + VMX TSC Offset
> >>>
> >>> where, Host TSC = Host MSR_IA32_TSC + Host MSR_IA32_TSC_ADJUST.
> >>>
> >>> This patchset, when cooperating with another QEMU patchset (sent in
> >>> another email "target-i386: save/restore vcpu's TSC rate during
> >>> migration"), allows guest programs observe a consistent TSC rate even
> >>> though they are migrated among machines with different host TSC rates.
> >>>
> >>> VMX TSC scaling shares some common logics with SVM TSC ratio which
> >>> is already supported by KVM. Patch 1 ~ 8 move those common logics from
> >>> SVM code to the common code. Upon them, patch 9 ~ 12 add VMX-specific
> >>> support for VMX TSC scaling.
> >>>
> >>> Changes in v2:
> >>>  * Remove the duplicated variable 'kvm_tsc_scaling_ratio_rsvd'.
> >>>  * Remove an unnecessary error check in original patch 2.
> >>>  * Do 64-bit arithmetic by functions recommended by Paolo.
> >>>  * Make kvm_set_tsc_khz() returns an error number so that ioctl
> >>>KVM_SET_TSC_KHZ does not return 0 if errors happen.
> >>>
> >>> Reviewed-by: Eric Northup 
> >>
> >> Thanks for the patches.  There are a couple changes that I can do myself:
> >>
> >> 1) kvm_default_tsc_scaling_ratio can be initialized in
> >> kvm_arch_hardware_setup, since it's just 1ULL <<
> >> kvm_tsc_scaling_ratio_frac_bits
> >>
> > Agree
> > 
> >> 2) things that you are adding to include/linux/kvm_host.h should instead
> >> go in arch/x86/include/linux/kvm_host.h
> >>
> > Agree, because they are x86 specific.
> > 
> >> That's it.  I'll commit it as soon as I test on AMD (today hopefully).
> 
> It tested fine.  I'll give it a shot with the 32-bit mul_u64_u64_shr on
> Monday as well, but I don't expect any issue.
> 
> Thanks, the patches are neat!
> 
> Paolo

Thank you for the test!

Haozhong

> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/12] KVM: x86: add support for VMX TSC scaling

2015-11-06 Thread Haozhong Zhang
On 11/06/15 11:49, Paolo Bonzini wrote:
> 
> 
> On 20/10/2015 09:39, Haozhong Zhang wrote:
> > This patchset adds support for VMX TSC scaling feature which is
> > available on Intel Skylake CPU. The specification of VMX TSC scaling
> > can be found at
> > http://www.intel.com/content/www/us/en/processors/timestamp-counter-scaling-virtualization-white-paper.html
> > 
> > VMX TSC scaling allows guest TSC which is read by guest rdtsc(p)
> > instructions increases in a rate that is customized by the hypervisor
> > and can be different than the host TSC rate. Basically, VMX TSC
> > scaling adds a 64-bit field called TSC multiplier in VMCS so that, if
> > VMX TSC scaling is enabled, TSC read by guest rdtsc(p) instructions
> > will be calculated by the following formula:
> > 
> >   guest EDX:EAX = (Host TSC * TSC multiplier) >> 48 + VMX TSC Offset
> > 
> > where, Host TSC = Host MSR_IA32_TSC + Host MSR_IA32_TSC_ADJUST.
> > 
> > This patchset, when cooperating with another QEMU patchset (sent in
> > another email "target-i386: save/restore vcpu's TSC rate during
> > migration"), allows guest programs observe a consistent TSC rate even
> > though they are migrated among machines with different host TSC rates.
> > 
> > VMX TSC scaling shares some common logics with SVM TSC ratio which
> > is already supported by KVM. Patch 1 ~ 8 move those common logics from
> > SVM code to the common code. Upon them, patch 9 ~ 12 add VMX-specific
> > support for VMX TSC scaling.
> > 
> > Changes in v2:
> >  * Remove the duplicated variable 'kvm_tsc_scaling_ratio_rsvd'.
> >  * Remove an unnecessary error check in original patch 2.
> >  * Do 64-bit arithmetic by functions recommended by Paolo.
> >  * Make kvm_set_tsc_khz() returns an error number so that ioctl
> >KVM_SET_TSC_KHZ does not return 0 if errors happen.
> > 
> > Reviewed-by: Eric Northup 
> 
> Thanks for the patches.  There are a couple changes that I can do myself:
> 
> 1) kvm_default_tsc_scaling_ratio can be initialized in
> kvm_arch_hardware_setup, since it's just 1ULL <<
> kvm_tsc_scaling_ratio_frac_bits
>
Agree

> 2) things that you are adding to include/linux/kvm_host.h should instead
> go in arch/x86/include/linux/kvm_host.h
>
Agree, because they are x86 specific.

> That's it.  I'll commit it as soon as I test on AMD (today hopefully).
>

Thanks,
Haozhong

> Paolo
> 
> > Haozhong Zhang (12):
> >   KVM: x86: Collect information for setting TSC scaling ratio
> >   KVM: x86: Add a common TSC scaling ratio field in kvm_vcpu_arch
> >   KVM: x86: Add a common TSC scaling function
> >   KVM: x86: Replace call-back set_tsc_khz() with a common function
> >   KVM: x86: Replace call-back compute_tsc_offset() with a common function
> >   KVM: x86: Move TSC scaling logic out of call-back adjust_tsc_offset()
> >   KVM: x86: Move TSC scaling logic out of call-back read_l1_tsc()
> >   KVM: x86: Use the correct vcpu's TSC rate to compute time scale
> >   KVM: VMX: Enable and initialize VMX TSC scaling
> >   KVM: VMX: Setup TSC scaling ratio when a vcpu is loaded
> >   KVM: VMX: Use a scaled host TSC for guest readings of MSR_IA32_TSC
> >   KVM: VMX: Dump TSC multiplier in dump_vmcs()
> > 
> >  arch/x86/include/asm/kvm_host.h |  24 +++
> >  arch/x86/include/asm/vmx.h  |   3 +
> >  arch/x86/kvm/lapic.c|   4 +-
> >  arch/x86/kvm/svm.c  | 116 --
> >  arch/x86/kvm/vmx.c  |  64 ++-
> >  arch/x86/kvm/x86.c  | 134 
> > +++-
> >  include/linux/kvm_host.h|  20 ++
> >  include/linux/math64.h  |  99 +
> >  8 files changed, 297 insertions(+), 167 deletions(-)
> > 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/12] KVM: x86: add support for VMX TSC scaling

2015-10-23 Thread Haozhong Zhang
On Fri, Oct 23, 2015 at 02:51:06PM +0200, Paolo Bonzini wrote:
> 
> 
> On 23/10/2015 14:46, Joerg Roedel wrote:
> >> > No, since I don't have AMD machines at hand. The modifications to SVM
> >> > code are mostly lifting common code with VMX TSC scaling code, so it
> >> > should still work on AMD machines.
> > Well, I think it would be good if you can provide a Tested-by on AMD
> > machines from someone who has one. Or get one yourself when changing AMD
> > specific code, they are not that expensive :)
> > I can do some testing when I am back from my travels, but that will not
> > be before early November.
> 
> I have one now (mine, not just Red Hat's). :D
> 
> Paolo

Hi Paolo,

I just posted the test instructions. It would be very appreciated if
you can help to test this patchset on AMD machines (two are required).

Thanks,
Haozhong

> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/12] KVM: x86: add support for VMX TSC scaling

2015-10-23 Thread Haozhong Zhang
Following is how I test this patchset. It should also apply to AMD
machines by replacing Intel with AMD and VMX TSC scaling with SVM TSC
ratio.

* Hardware Requirements
  1) Two machines with Intel CPUs, called M_A and M_B below.
  2) TSC frequency of CPUs on M_A is different from CPUs on M_B.
 Suppose TSC frequency on M_A is f_a KHz.
  3) At least CPUs on M_B support VMX TSC scaling.

* Software Requirements
  1) Apply this patchset to KVM on both machines.
  2) Apply QEMU patches[1] to QEMU commit 40fe17b on both machines

* Test Process
  1) Start a linux guest on M_A
   qemu-system-x86_64 -enable-kvm -smp 4 -cpu qemu66 -m 512 -hda linux.img
   
  2) In guest linux, check the TSC frequency detected by Linux kernel.
 e.g. search in dmeg for messages like
   "tsc: Detected XYZ.ABC MHz processor" or
   "tsc: Refined TSC clocksource calibration: XYZ.ABC MHz"
  
  3) Start QEMU waiting for migration on M_B:
   qemu-system-x86_64 -enable-kvm -smp 4 -cpu qemu64,load-tsc-freq -m 512 
-hda linux.img -incoming tcp:0:1234
   
  4) Migrate above VM to M_B as normal in QEMU monitor:
   migrate tcp::1234
   
  5) After the migration, if VMX TSC scaling and this patchset work on
 M_B, no messages like
   "Clocksource tsc unstable (delta = x ns)"
 should appear in dmesg of guest linux

  6) Furthermore, users can also check whether guest TSC after the
 migration increases in the same rate as before by running the
 attached program test_tsc in VM:
   ./test_tsc N f_a
 It measures the number of TSC ticks passed in N seconds, and
 divides it by the expected TSC frequency f_a to get the output
 result. If this patchset works, the output should be very closed
 to N
  
[1] http://www.spinics.net/lists/kvm/msg122421.html

Thanks,
Haozhong
#include 
#include 
#include 
#include 

static inline uint64_t rdtsc(void)
{
uint32_t lo, hi;
asm volatile("lfence; rdtsc" : "=a" (lo), "=d" (hi));
return (uint64_t)hi << 32 | lo;
}

int main(int argc, char **argv)
{
uint64_t tsc0, tsc1;
int ns, tsc_khz;
double delta;

if (argc < 2) {
printf("Usage: %s  \n", argv[0]);
return -1;
}

if ((ns = atoi(argv[1])) <= 0)
return -1;
if ((tsc_khz = atoi(argv[2])) <= 0)
return -1;

tsc0 = rdtsc();
sleep(ns);
tsc1 = rdtsc();

delta = tsc1 - tsc0;
printf("Passed %lf s\n", delta / (tsc_khz * 1000.0));

return 0;
}


Re: [PATCH v2 00/12] KVM: x86: add support for VMX TSC scaling

2015-10-23 Thread Haozhong Zhang
On Fri, Oct 23, 2015 at 02:46:19PM +0200, Joerg Roedel wrote:
> On Fri, Oct 23, 2015 at 08:32:28PM +0800, Haozhong Zhang wrote:
> > No, since I don't have AMD machines at hand. The modifications to SVM
> > code are mostly lifting common code with VMX TSC scaling code, so it
> > should still work on AMD machines.
> 
> Well, I think it would be good if you can provide a Tested-by on AMD
> machines from someone who has one. Or get one yourself when changing AMD
> specific code, they are not that expensive :)
> I can do some testing when I am back from my travels, but that will not
> be before early November.
> 
>   Joerg

I'll try to get a test result. And it would be very appreciated if you
could test as well.

Thanks!
Haozhong

> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/12] KVM: x86: add support for VMX TSC scaling

2015-10-23 Thread Haozhong Zhang
On Fri, Oct 23, 2015 at 12:06:50PM +0200, Joerg Roedel wrote:
> On Tue, Oct 20, 2015 at 03:39:00PM +0800, Haozhong Zhang wrote:
> > VMX TSC scaling shares some common logics with SVM TSC ratio which
> > is already supported by KVM. Patch 1 ~ 8 move those common logics from
> > SVM code to the common code. Upon them, patch 9 ~ 12 add VMX-specific
> > support for VMX TSC scaling.
> 
> Have you tested your changes on an AMD machine too?
> 
> 
>   Joerg
> 

No, since I don't have AMD machines at hand. The modifications to SVM
code are mostly lifting common code with VMX TSC scaling code, so it
should still work on AMD machines.

Haozhong
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 01/12] KVM: x86: Collect information for setting TSC scaling ratio

2015-10-20 Thread Haozhong Zhang
The number of bits of the fractional part of the 64-bit TSC scaling
ratio in VMX and SVM is different. This patch makes the architecture
code to collect the number of fractional bits and other related
information into variables that can be accessed in the common code.

Signed-off-by: Haozhong Zhang 
---
 arch/x86/include/asm/kvm_host.h | 6 ++
 arch/x86/kvm/svm.c  | 4 
 arch/x86/kvm/x86.c  | 6 ++
 3 files changed, 16 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 53deb27..0540dc8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -990,6 +990,12 @@ extern bool kvm_has_tsc_control;
 extern u32  kvm_min_guest_tsc_khz;
 /* maximum supported tsc_khz for guests */
 extern u32  kvm_max_guest_tsc_khz;
+/* number of bits of the fractional part of the TSC scaling ratio */
+extern u8   kvm_tsc_scaling_ratio_frac_bits;
+/* default TSC scaling ratio (= 1.0) */
+extern u64  kvm_default_tsc_scaling_ratio;
+/* maximum allowed value of TSC scaling ratio */
+extern u64  kvm_max_tsc_scaling_ratio;
 
 enum emulation_result {
EMULATE_DONE, /* no further processing */
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index cd8659c..55f5f49 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -908,7 +908,11 @@ static __init int svm_hardware_setup(void)
max = min(0x7fffULL, __scale_tsc(tsc_khz, TSC_RATIO_MAX));
 
kvm_max_guest_tsc_khz = max;
+
+   kvm_max_tsc_scaling_ratio = TSC_RATIO_MAX;
+   kvm_tsc_scaling_ratio_frac_bits = 32;
}
+   kvm_default_tsc_scaling_ratio = TSC_RATIO_DEFAULT;
 
if (nested) {
printk(KERN_INFO "kvm: Nested Virtualization enabled\n");
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9e9c226..79cbbb5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -109,6 +109,12 @@ bool kvm_has_tsc_control;
 EXPORT_SYMBOL_GPL(kvm_has_tsc_control);
 u32  kvm_max_guest_tsc_khz;
 EXPORT_SYMBOL_GPL(kvm_max_guest_tsc_khz);
+u8   kvm_tsc_scaling_ratio_frac_bits;
+EXPORT_SYMBOL_GPL(kvm_tsc_scaling_ratio_frac_bits);
+u64  kvm_default_tsc_scaling_ratio;
+EXPORT_SYMBOL_GPL(kvm_default_tsc_scaling_ratio);
+u64  kvm_max_tsc_scaling_ratio;
+EXPORT_SYMBOL_GPL(kvm_max_tsc_scaling_ratio);
 
 /* tsc tolerance in parts per million - default to 1/2 of the NTP threshold */
 static u32 tsc_tolerance_ppm = 250;
-- 
2.4.8

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 11/12] KVM: VMX: Use a scaled host TSC for guest readings of MSR_IA32_TSC

2015-10-20 Thread Haozhong Zhang
This patch makes kvm-intel to return a scaled host TSC plus the TSC
offset when handling guest readings to MSR_IA32_TSC.

Signed-off-by: Haozhong Zhang 
---
 arch/x86/kvm/vmx.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index c241ff3..a02b59c 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2372,15 +2372,16 @@ static void setup_msrs(struct vcpu_vmx *vmx)
 
 /*
  * reads and returns guest's timestamp counter "register"
- * guest_tsc = host_tsc + tsc_offset-- 21.3
+ * guest_tsc = (host_tsc * tsc multiplier) >> 48 + tsc_offset
+ * -- Intel TSC Scaling for Virtualization White Paper, sec 1.3
  */
-static u64 guest_read_tsc(void)
+static u64 guest_read_tsc(struct kvm_vcpu *vcpu)
 {
u64 host_tsc, tsc_offset;
 
host_tsc = rdtsc();
tsc_offset = vmcs_read64(TSC_OFFSET);
-   return host_tsc + tsc_offset;
+   return kvm_scale_tsc(vcpu, host_tsc) + tsc_offset;
 }
 
 /*
@@ -2772,7 +2773,7 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
case MSR_EFER:
return kvm_get_msr_common(vcpu, msr_info);
case MSR_IA32_TSC:
-   msr_info->data = guest_read_tsc();
+   msr_info->data = guest_read_tsc(vcpu);
break;
case MSR_IA32_SYSENTER_CS:
msr_info->data = vmcs_read32(GUEST_SYSENTER_CS);
-- 
2.4.8

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >