date:20130520

Re: [PATCH 0/8] s390/kvm fixes

2013-05-20 Thread Martin Schwidefsky

On Sun, 19 May 2013 11:49:43 +0300
Gleb Natapov  wrote:

> Hi Christian,
> 
> On Fri, May 17, 2013 at 02:41:30PM +0200, Christian Borntraeger wrote:
> > Gleb, Paolo, Marcelo,
> > 
> > here are some low level changes to kvm on s390 that we have been
> > cooking for a while now.
> > 
> > Patch "s390/pgtable: fix ipte notify bit" will go via Martins
> > tree into 3.10, but is included to reduce the amount of merge
> > conflicts. 
> > 
> > Patch "s390: fix gmap_ipte_notifier vs. software dirty pages"
> > will also go via Martins tree into 3.10 and it fixes a hang with
> > heavy host paging and KVM. This is optional for merging, but
> > makes testing on kvm/next easier.
> > 
> Can I add Martin's ACKs to those two then?

Yes.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio-net: Reporting traffic queue distribution statistics through ethtool

2013-05-20 Thread Jason Wang

On 05/21/2013 09:26 AM, Narasimhan, Sriram wrote:
>
> -Original Message-
> From: Michael S. Tsirkin [mailto:m...@redhat.com] 
> Sent: Monday, May 20, 2013 2:59 AM
> To: Narasimhan, Sriram
> Cc: ru...@rustcorp.com.au; virtualizat...@lists.linux-foundation.org; 
> kvm@vger.kernel.org; net...@vger.kernel.org; linux-ker...@vger.kernel.org; 
> Jason Wang
> Subject: Re: [PATCH] virtio-net: Reporting traffic queue distribution 
> statistics through ethtool
>
> On Sun, May 19, 2013 at 10:56:16PM +, Narasimhan, Sriram wrote:
>> Hi Michael,
>>
>> Comments inline...
>>
>> -Original Message-
>> From: Michael S. Tsirkin [mailto:m...@redhat.com] 
>> Sent: Sunday, May 19, 2013 1:03 PM
>> To: Narasimhan, Sriram
>> Cc: ru...@rustcorp.com.au; virtualizat...@lists.linux-foundation.org; 
>> kvm@vger.kernel.org; net...@vger.kernel.org; linux-ker...@vger.kernel.org; 
>> Jason Wang
>> Subject: Re: [PATCH] virtio-net: Reporting traffic queue distribution 
>> statistics through ethtool
>>
>> On Sun, May 19, 2013 at 04:09:48PM +, Narasimhan, Sriram wrote:
>>> Hi Michael,
>>>
>>> I was getting all packets on the same inbound queue which
>>> is why I added this support to virtio-net (and some more
>>> instrumentation at tun as well).  But, it turned out to be
>>> my misconfiguration - I did not enable IFF_MULTI_QUEUE on
>>> the tap device, so the real_num_tx_queues on tap netdev was 
>>> always 1 (no tx distribution at tap).
>> Interesting that qemu didn't fail.
>>
>> [Sriram] void tun_set_real_num_tx_queues() does not return 
>> the EINVAL return from netif_set_real_num_tx_queues() for 
>> txq > dev->num_tx_queues (which would be the case if the
>> tap device were not created with IFF_MULTI_QUEUE).  I think
>> it would be better to fix the code to disable the new 
>> queue and fail tun_attach()
> You mean fail TUNSETQUEUE?
>
> [Sriram] No I meant TUNSETIFF. FYI, I am using QEMU 1.4.50.
> I created the tap device using tunctl.  This does not
> specify the IFF_MULTI_QUEUE flag during create so 1 netdev
> queue is allocated.  But, when the tun device is closed,
> tun_detach decrements tun->numqueues from 1 to 0.
>
> The following were the options I passed to qemu: 
> -netdev tap,id=hostnet1,vhost=on,ifname=tap1,queues=4 
> -device virtio-net-pci,netdev=hostnet1,id=net1,
> mac=52:54:00:9b:8e:24,mq=on,vectors=9,ctrl_vq=on
>
>
>> in this scenario.  If you 
>> agree, I can go ahead and create a separate patch for that.
> Hmm I not sure I understand what happens, so hard for me to tell.
>
> I think this code was supposed to handle it:
> err = -EBUSY;
> if (!(tun->flags & TUN_TAP_MQ) && tun->numqueues == 1)
> goto out;
>
> why doesn't it?
>
> [Sriram] This question was raised by Jason as well.
> When QEMU is started with multiple queues on the tap 
> device, it calls TUNSETIFF on the existing tap device with 
> IFF_MULTI_QUEUE set.  The above code falls through since 
> tun->numqueues = 0 due to the previous tun_detach() during 
> close. At the end of this, tun_set_iff() sets TUN_TAP_MQ
> flag for the tun data structure.  From that point onwards,
> subsequent TUNSETIFF will fall through resulting in the
> mismatch in #queues between tun and netdev structures. 
>

Thanks, I think I get it. The problem is we only allocate a one queue
netdevice when IFF_MULTI_QUEUE were not set. So reset the multiqueue
flag for persist device should be forbidden. Looks like we need the
following patch. Could you please test this?

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index f042b03..d4fc2bd 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1585,6 +1585,10 @@ static int tun_set_iff(struct net *net, struct
file *file, struct ifreq *ifr)
else
return -EINVAL;
 
+   if (((ifr->ifr_flags & IFF_MULTI_QUEUE) ==
IFF_MULTI_QUEUE) ^
+   ((tun->flags & TUN_TAP_MQ) == TUN_TAP_MQ))
+   return -EINVAL;
+
if (tun_not_capable(tun))
return -EPERM;
err = security_tun_dev_open(tun->security);


>>> I am thinking about 
>>> adding a -q option to tunctl to specify multi-queue flag on
>>> the tap device.
>> Absolutely.
>>
>> [Sriram] OK, let me do that.
>
> [Sriram] I am planning to add ip tuntap multi-queue option 
> instead of tunctl.  
>
> Sriram
>
>>> Yes, number of exits will be most useful.  I will look into
>>> adding the other statistics you mention.
>>>
>>> Sriram  
>> Pls note you'll need to switch to virtqueue_kick_prepare
>> to detect exits: virtqueue_kick doesn't let you know
>> whether there was an exit.
>>
>> Also, it's best to make this a separate patch from the one
>> adding per-queue stats.
>>
>> [Sriram] OK, I will cover only the per-queue statistics in
>> this patch. Also, I will address the indentation/data 
>> structure name points that you mentioned in your earlier
>> email and send a new revision for this patch.
>>
>> Sriram
>>
>>> -Origi

Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages

2013-05-20 Thread Xiao Guangrong

On 05/21/2013 04:40 AM, Marcelo Tosatti wrote:
> On Mon, May 20, 2013 at 11:15:45PM +0300, Gleb Natapov wrote:
>> On Mon, May 20, 2013 at 04:46:24PM -0300, Marcelo Tosatti wrote:
>>> On Fri, May 17, 2013 at 05:12:58AM +0800, Xiao Guangrong wrote:
 The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
 walk and zap all shadow pages one by one, also it need to zap all guest
 page's rmap and all shadow page's parent spte list. Particularly, things
 become worse if guest uses more memory or vcpus. It is not good for
 scalability

 In this patch, we introduce a faster way to invalidate all shadow pages.
 KVM maintains a global mmu invalid generation-number which is stored in
 kvm->arch.mmu_valid_gen and every shadow page stores the current global
 generation-number into sp->mmu_valid_gen when it is created

 When KVM need zap all shadow pages sptes, it just simply increase the
 global generation-number then reload root shadow pages on all vcpus.
 Vcpu will create a new shadow page table according to current kvm's
 generation-number. It ensures the old pages are not used any more.
 Then the invalid-gen pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
 are zapped by using lock-break technique

 Signed-off-by: Xiao Guangrong 
 ---
  arch/x86/include/asm/kvm_host.h |2 +
  arch/x86/kvm/mmu.c  |  103 
 +++
  arch/x86/kvm/mmu.h  |1 +
  3 files changed, 106 insertions(+), 0 deletions(-)

 diff --git a/arch/x86/include/asm/kvm_host.h 
 b/arch/x86/include/asm/kvm_host.h
 index 3741c65..bff7d46 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -222,6 +222,7 @@ struct kvm_mmu_page {
int root_count;  /* Currently serving as active root */
unsigned int unsync_children;
unsigned long parent_ptes;  /* Reverse mapping for parent_pte */
 +  unsigned long mmu_valid_gen;
DECLARE_BITMAP(unsync_child_bitmap, 512);
  
  #ifdef CONFIG_X86_32
 @@ -529,6 +530,7 @@ struct kvm_arch {
unsigned int n_requested_mmu_pages;
unsigned int n_max_mmu_pages;
unsigned int indirect_shadow_pages;
 +  unsigned long mmu_valid_gen;
struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
/*
 * Hash table of struct kvm_mmu_page.
 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 682ecb4..891ad2c 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -1839,6 +1839,11 @@ static void clear_sp_write_flooding_count(u64 *spte)
__clear_sp_write_flooding_count(sp);
  }
  
 +static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
 +{
 +  return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
 +}
 +
  static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 gfn_t gfn,
 gva_t gaddr,
 @@ -1865,6 +1870,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
 kvm_vcpu *vcpu,
role.quadrant = quadrant;
}
for_each_gfn_sp(vcpu->kvm, sp, gfn) {
 +  if (is_obsolete_sp(vcpu->kvm, sp))
 +  continue;
 +
>>>
>>> Whats the purpose of not using pages which are considered "obsolete" ?
>>>
>> The same as not using page that is invalid, to not reuse stale
>> information. The page may contain ptes that point to invalid slot.
> 
> Any pages with stale information will be zapped by kvm_mmu_zap_all().
> When that happens, page faults will take place which will automatically 
> use the new generation number.

kvm_mmu_zap_all() uses lock-break technique to zap pages, before it zaps
all obsolete pages other vcpus can require mmu-lock and call kvm_mmu_get_page()
to install new page. In this case, obsolete page still live in hast-table and
can be found by kvm_mmu_get_page().

> 
> So still not clear why is this necessary.
> 
>>> The only purpose of the generation number should be to guarantee forward
>>> progress of kvm_mmu_zap_all in face of allocators (kvm_mmu_get_page).
>>>
>>> That is, on allocation you store the generation number on the shadow page.
>>> The only purpose of that number in the shadow page is so that
>>> kvm_mmu_zap_all can skip deleting shadow pages which have been created
>>> after kvm_mmu_zap_all began operation.
>>>
if (!need_sync && sp->unsync)
need_sync = true;
  
 @@ -1901,6 +1909,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
 kvm_vcpu *vcpu,
  
account_shadowed(vcpu->kvm, gfn);
}
 +  sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
init_shadow_page_table(sp);
trace_kvm_mmu_get_page(sp, true);
return sp;
 @@ -2071,8

Re: [PATCH 0/5 v2] VFIO PPC64: add VFIO support on POWERPC64

2013-05-20 Thread Alexey Kardashevskiy

Oops, wrong subject (cut-n-paste) :)
There are 3 patches, not 5.


On 05/21/2013 01:33 PM, Alexey Kardashevskiy wrote:
> The series adds support for VFIO on POWERPC in user space (such as QEMU).
> The in-kernel real mode IOMMU support is added by another series posted
> separately.
> 
> As the first and main aim of this series is the POWERNV platform support,
> the "Enable on POWERNV platform" patch goes first and introduces an API
> to be used by the VFIO IOMMU driver. The "Enable on pSeries platform" patch
> simply registers PHBs in the IOMMU subsystem and expects the API to be 
> present,
> it enables VFIO support in fully emulated QEMU guests.
> 
> The main change is that this series was changed and tested against v3.10-rc1.
> It also contains some bugfixes which are mentioned (if any) in the patch 
> messages.
> 
> Alexey Kardashevskiy (3):
>   powerpc/vfio: Enable on POWERNV platform
>   powerpc/vfio: Implement IOMMU driver for VFIO
>   powerpc/vfio: Enable on pSeries platform


-- 
Alexey
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/3] powerpc/vfio: Enable on pSeries platform

2013-05-20 Thread Alexey Kardashevskiy

The enables VFIO on the pSeries platform, enabling user space
programs to access PCI devices directly.

Signed-off-by: Alexey Kardashevskiy 
Cc: David Gibson 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/platforms/pseries/iommu.c |4 
 drivers/iommu/Kconfig  |2 +-
 drivers/vfio/Kconfig   |2 +-
 3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 86ae364..23fc1dc 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -614,6 +614,7 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
 
iommu_table_setparms(pci->phb, dn, tbl);
pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
+   iommu_register_group(tbl, pci_domain_nr(bus), 0);
 
/* Divide the rest (1.75GB) among the children */
pci->phb->dma_window_size = 0x8000ul;
@@ -658,6 +659,7 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
   ppci->phb->node);
iommu_table_setparms_lpar(ppci->phb, pdn, tbl, dma_window);
ppci->iommu_table = iommu_init_table(tbl, ppci->phb->node);
+   iommu_register_group(tbl, pci_domain_nr(bus), 0);
pr_debug("  created table: %p\n", ppci->iommu_table);
}
 }
@@ -684,6 +686,7 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
   phb->node);
iommu_table_setparms(phb, dn, tbl);
PCI_DN(dn)->iommu_table = iommu_init_table(tbl, phb->node);
+   iommu_register_group(tbl, pci_domain_nr(phb->bus), 0);
set_iommu_table_base(&dev->dev, PCI_DN(dn)->iommu_table);
return;
}
@@ -1184,6 +1187,7 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev 
*dev)
   pci->phb->node);
iommu_table_setparms_lpar(pci->phb, pdn, tbl, dma_window);
pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
+   iommu_register_group(tbl, pci_domain_nr(pci->phb->bus), 0);
pr_debug("  created table: %p\n", pci->iommu_table);
} else {
pr_debug("  found DMA window, table: %p\n", pci->iommu_table);
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 3f3abde..01730b2 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -263,7 +263,7 @@ config SHMOBILE_IOMMU_L1SIZE
 
 config SPAPR_TCE_IOMMU
bool "sPAPR TCE IOMMU Support"
-   depends on PPC_POWERNV
+   depends on PPC_POWERNV || PPC_PSERIES
select IOMMU_API
help
  Enables bits of IOMMU API required by VFIO. The iommu_ops
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index b464687..26b3d9d 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -12,7 +12,7 @@ menuconfig VFIO
tristate "VFIO Non-Privileged userspace driver framework"
depends on IOMMU_API
select VFIO_IOMMU_TYPE1 if X86
-   select VFIO_IOMMU_SPAPR_TCE if PPC_POWERNV
+   select VFIO_IOMMU_SPAPR_TCE if (PPC_POWERNV || PPC_PSERIES)
help
  VFIO provides a framework for secure userspace device drivers.
  See Documentation/vfio.txt for more details.
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/3] powerpc/vfio: Enable on POWERNV platform

2013-05-20 Thread Alexey Kardashevskiy

This initializes IOMMU groups based on the IOMMU configuration
discovered during the PCI scan on POWERNV (POWER non virtualized)
platform.  The IOMMU groups are to be used later by the VFIO driver,
which is used for PCI pass through.

It also implements an API for mapping/unmapping pages for
guest PCI drivers and providing DMA window properties.
This API is going to be used later by QEMU-VFIO to handle
h_put_tce hypercalls from the KVM guest.

The iommu_put_tce_user_mode() does only a single page mapping
as an API for adding many mappings at once is going to be
added later.

Although this driver has been tested only on the POWERNV
platform, it should work on any platform which supports
TCE tables.  As h_put_tce hypercall is received by the host
kernel and processed by the QEMU (what involves calling
the host kernel again), performance is not the best -
circa 220MB/s on 10Gb ethernet network.

To enable VFIO on POWER, enable SPAPR_TCE_IOMMU config
option and configure VFIO as required.

Cc: David Gibson 
Signed-off-by: Alexey Kardashevskiy 
Signed-off-by: Paul Mackerras 

---

Changes:
* the PCI devices are added to groups from subsys_initcall_sync (used to
be done via bus_notifier callback registered for PCI bus)
* fixed parameters checking (the very last page in the address space was
not handled correctly)
---
 arch/powerpc/include/asm/iommu.h|   26 +++
 arch/powerpc/kernel/iommu.c |  323 +++
 arch/powerpc/platforms/powernv/pci-ioda.c   |1 +
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |5 +-
 arch/powerpc/platforms/powernv/pci.c|2 +
 drivers/iommu/Kconfig   |8 +
 6 files changed, 364 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index cbfe678..98d1422 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -76,6 +76,9 @@ struct iommu_table {
struct iommu_pool large_pool;
struct iommu_pool pools[IOMMU_NR_POOLS];
unsigned long *it_map;   /* A simple allocation bitmap for now */
+#ifdef CONFIG_IOMMU_API
+   struct iommu_group *it_group;
+#endif
 };
 
 struct scatterlist;
@@ -98,6 +101,8 @@ extern void iommu_free_table(struct iommu_table *tbl, const 
char *node_name);
  */
 extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
int nid);
+extern void iommu_register_group(struct iommu_table *tbl,
+int pci_domain_number, unsigned long pe_num);
 
 extern int iommu_map_sg(struct device *dev, struct iommu_table *tbl,
struct scatterlist *sglist, int nelems,
@@ -147,5 +152,26 @@ static inline void iommu_restore(void)
 }
 #endif
 
+/* The API to support IOMMU operations for VFIO */
+extern int iommu_tce_clear_param_check(struct iommu_table *tbl,
+   unsigned long ioba, unsigned long tce_value,
+   unsigned long npages);
+extern int iommu_tce_put_param_check(struct iommu_table *tbl,
+   unsigned long ioba, unsigned long tce);
+extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
+   unsigned long hwaddr, enum dma_data_direction direction);
+extern unsigned long iommu_clear_tce(struct iommu_table *tbl,
+   unsigned long entry);
+extern int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
+   unsigned long entry, unsigned long pages);
+extern int iommu_put_tce_user_mode(struct iommu_table *tbl,
+   unsigned long entry, unsigned long tce);
+
+extern void iommu_flush_tce(struct iommu_table *tbl);
+extern int iommu_take_ownership(struct iommu_table *tbl);
+extern void iommu_release_ownership(struct iommu_table *tbl);
+
+extern enum dma_data_direction iommu_tce_direction(unsigned long tce);
+
 #endif /* __KERNEL__ */
 #endif /* _ASM_IOMMU_H */
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index c0d0dbd..b20ff17 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -36,6 +36,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -44,6 +46,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define DBG(...)
 
@@ -724,6 +727,13 @@ void iommu_free_table(struct iommu_table *tbl, const char 
*node_name)
if (tbl->it_offset == 0)
clear_bit(0, tbl->it_map);
 
+#ifdef CONFIG_IOMMU_API
+   if (tbl->it_group) {
+   iommu_group_put(tbl->it_group);
+   BUG_ON(tbl->it_group);
+   }
+#endif
+
/* verify that table contains no entries */
if (!bitmap_empty(tbl->it_map, tbl->it_size))
pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name);
@@ -860,3 +870,316 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t 
size,
free_pages((unsigned long)vaddr, get_order(size));
}
 }
+
+#ifdef CONFIG_I

[PATCH 0/5 v2] VFIO PPC64: add VFIO support on POWERPC64

2013-05-20 Thread Alexey Kardashevskiy

The series adds support for VFIO on POWERPC in user space (such as QEMU).
The in-kernel real mode IOMMU support is added by another series posted
separately.

As the first and main aim of this series is the POWERNV platform support,
the "Enable on POWERNV platform" patch goes first and introduces an API
to be used by the VFIO IOMMU driver. The "Enable on pSeries platform" patch
simply registers PHBs in the IOMMU subsystem and expects the API to be present,
it enables VFIO support in fully emulated QEMU guests.

The main change is that this series was changed and tested against v3.10-rc1.
It also contains some bugfixes which are mentioned (if any) in the patch 
messages.

Alexey Kardashevskiy (3):
  powerpc/vfio: Enable on POWERNV platform
  powerpc/vfio: Implement IOMMU driver for VFIO
  powerpc/vfio: Enable on pSeries platform

 Documentation/vfio.txt  |   63 +
 arch/powerpc/include/asm/iommu.h|   26 ++
 arch/powerpc/kernel/iommu.c |  323 +++
 arch/powerpc/platforms/powernv/pci-ioda.c   |1 +
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |5 +-
 arch/powerpc/platforms/powernv/pci.c|2 +
 arch/powerpc/platforms/pseries/iommu.c  |4 +
 drivers/iommu/Kconfig   |8 +
 drivers/vfio/Kconfig|6 +
 drivers/vfio/Makefile   |1 +
 drivers/vfio/vfio.c |1 +
 drivers/vfio/vfio_iommu_spapr_tce.c |  377 +++
 include/uapi/linux/vfio.h   |   34 +++
 13 files changed, 850 insertions(+), 1 deletion(-)
 create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c

-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/3] powerpc/vfio: Implement IOMMU driver for VFIO

2013-05-20 Thread Alexey Kardashevskiy

VFIO implements platform independent stuff such as
a PCI driver, BAR access (via read/write on a file descriptor
or direct mapping when possible) and IRQ signaling.

The platform dependent part includes IOMMU initialization
and handling.  This implements an IOMMU driver for VFIO
which does mapping/unmapping pages for the guest IO and
provides information about DMA window (required by a POWER
guest).

Cc: David Gibson 
Signed-off-by: Alexey Kardashevskiy 
Signed-off-by: Paul Mackerras 
---
 Documentation/vfio.txt  |   63 ++
 drivers/vfio/Kconfig|6 +
 drivers/vfio/Makefile   |1 +
 drivers/vfio/vfio.c |1 +
 drivers/vfio/vfio_iommu_spapr_tce.c |  377 +++
 include/uapi/linux/vfio.h   |   34 
 6 files changed, 482 insertions(+)
 create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index 8eda363..c55533c 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -283,6 +283,69 @@ a direct pass through for VFIO_DEVICE_* ioctls.  The 
read/write/mmap
 interfaces implement the device region access defined by the device's
 own VFIO_DEVICE_GET_REGION_INFO ioctl.
 
+
+PPC64 sPAPR implementation note
+---
+
+This implementation has some specifics:
+
+1) Only one IOMMU group per container is supported as an IOMMU group
+represents the minimal entity which isolation can be guaranteed for and
+groups are allocated statically, one per a Partitionable Endpoint (PE)
+(PE is often a PCI domain but not always).
+
+2) The hardware supports so called DMA windows - the PCI address range
+within which DMA transfer is allowed, any attempt to access address space
+out of the window leads to the whole PE isolation.
+
+3) PPC64 guests are paravirtualized but not fully emulated. There is an API
+to map/unmap pages for DMA, and it normally maps 1..32 pages per call and
+currently there is no way to reduce the number of calls. In order to make 
things
+faster, the map/unmap handling has been implemented in real mode which provides
+an excellent performance which has limitations such as inability to do
+locked pages accounting in real time.
+
+So 3 additional ioctls have been added:
+
+   VFIO_IOMMU_SPAPR_TCE_GET_INFO - returns the size and the start
+   of the DMA window on the PCI bus.
+
+   VFIO_IOMMU_ENABLE - enables the container. The locked pages accounting
+   is done at this point. This lets user first to know what
+   the DMA window is and adjust rlimit before doing any real job.
+
+   VFIO_IOMMU_DISABLE - disables the container.
+
+
+The code flow from the example above should be slightly changed:
+
+   .
+   /* Add the group to the container */
+   ioctl(group, VFIO_GROUP_SET_CONTAINER, &container);
+
+   /* Enable the IOMMU model we want */
+   ioctl(container, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU)
+
+   /* Get addition sPAPR IOMMU info */
+   vfio_iommu_spapr_tce_info spapr_iommu_info;
+   ioctl(container, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &spapr_iommu_info);
+
+   if (ioctl(container, VFIO_IOMMU_ENABLE))
+   /* Cannot enable container, may be low rlimit */
+
+   /* Allocate some space and setup a DMA mapping */
+   dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE,
+MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+
+   dma_map.size = 1024 * 1024;
+   dma_map.iova = 0; /* 1MB starting at 0x0 from device view */
+   dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+
+   /* Check here is .iova/.size are within DMA window from 
spapr_iommu_info */
+
+   ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
+   .
+
 ---
 
 [1] VFIO was originally an acronym for "Virtual Function I/O" in its
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 7cd5dec..b464687 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -3,10 +3,16 @@ config VFIO_IOMMU_TYPE1
depends on VFIO
default n
 
+config VFIO_IOMMU_SPAPR_TCE
+   tristate
+   depends on VFIO && SPAPR_TCE_IOMMU
+   default n
+
 menuconfig VFIO
tristate "VFIO Non-Privileged userspace driver framework"
depends on IOMMU_API
select VFIO_IOMMU_TYPE1 if X86
+   select VFIO_IOMMU_SPAPR_TCE if PPC_POWERNV
help
  VFIO provides a framework for secure userspace device drivers.
  See Documentation/vfio.txt for more details.
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 2398d4a..72bfabc 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -1,3 +1,4 @@
 obj-$(CONFIG_VFIO) += vfio.o
 obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
+obj-$(CONFIG_VFIO_IOMMU_SPAPR_TC

[PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-05-20 Thread Alexey Kardashevskiy

This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests without passing them to QEMU, which should
save time on switching to QEMU and back.

Both real and virtual modes are supported - whenever the kernel
fails to handle TCE request, it passes it to the virtual mode.
If it the virtual mode handlers fail, then the request is passed
to the user mode, for example, to QEMU.

This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
in-kernel handling of IOMMU map/unmap.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Cc: David Gibson 
Signed-off-by: Alexey Kardashevskiy 
Signed-off-by: Paul Mackerras 

---

Changes:
2013-05-20:
* removed get_user() from real mode handlers
* kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
translated TCEs, tries realmode_get_page() on those and if it fails, it
passes control over the virtual mode handler which tries to finish
the request handling
* kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
on a page
* The only reason to pass the request to user mode now is when the user mode
did not register TCE table in the kernel, in all other cases the virtual mode
handler is expected to do the job
---
 Documentation/virtual/kvm/api.txt   |   28 +
 arch/powerpc/include/asm/kvm_host.h |3 +
 arch/powerpc/include/asm/kvm_ppc.h  |2 +
 arch/powerpc/include/uapi/asm/kvm.h |7 ++
 arch/powerpc/kvm/book3s_64_vio.c|  198 ++-
 arch/powerpc/kvm/book3s_64_vio_hv.c |  193 +-
 arch/powerpc/kvm/powerpc.c  |   12 +++
 include/uapi/linux/kvm.h|4 +
 8 files changed, 441 insertions(+), 6 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 3c7c7ea..3c8e9fe 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2362,6 +2362,34 @@ calls by the guest for that service will be passed to 
userspace to be
 handled.
 
 
+4.79 KVM_CREATE_SPAPR_TCE_IOMMU
+
+Capability: KVM_CAP_SPAPR_TCE_IOMMU
+Architectures: powerpc
+Type: vm ioctl
+Parameters: struct kvm_create_spapr_tce_iommu (in)
+Returns: 0 on success, -1 on error
+
+This creates a link between IOMMU group and a hardware TCE (translation
+control entry) table. This link lets the host kernel know what IOMMU
+group (i.e. TCE table) to use for the LIOBN number passed with
+H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
+
+/* for KVM_CAP_SPAPR_TCE_IOMMU */
+struct kvm_create_spapr_tce_iommu {
+   __u64 liobn;
+   __u32 iommu_id;
+   __u32 flags;
+};
+
+No flag is supported at the moment.
+
+When the guest issues TCE call on a liobn for which a TCE table has been
+registered, the kernel will handle it in real mode, updating the hardware
+TCE table. TCE table calls for other liobns will cause a vm exit and must
+be handled by userspace.
+
+
 5. The kvm_run structure
 
 
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 85d8f26..ac0e2fe 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -180,6 +180,7 @@ struct kvmppc_spapr_tce_table {
struct kvm *kvm;
u64 liobn;
u32 window_size;
+   struct iommu_group *grp;/* used for IOMMU groups */
struct page *pages[0];
 };
 
@@ -611,6 +612,8 @@ struct kvm_vcpu_arch {
u64 busy_preempt;
 
unsigned long *tce_tmp;/* TCE cache for TCE_PUT_INDIRECT hall */
+   unsigned long tce_tmp_num; /* Number of handled TCEs in the cache */
+   unsigned long tce_reason;  /* The reason of switching to the virtmode */
 #endif
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index e852921b..934e01d 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -133,6 +133,8 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce *args);
+extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+   struct kvm_create_spapr_tce_iommu *args);
 extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
struct kvm_vcpu *vcpu, unsigned long liobn);
 extern long kvmppc_emulated_validate_tce(unsigned long tce);
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 0fb1a6e..cf82af4 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -319,6 +319,13 @@ struct kvm_create_spapr_tce {
__u32 window_size;
 };
 
+/* for KVM_CAP_SPAPR_TCE_IOMMU */
+struct kvm_create_spapr_tce_iommu {
+   __u64 liobn;
+

[PATCH 4/4] KVM: PPC: Add hugepage support for IOMMU in-kernel handling

2013-05-20 Thread Alexey Kardashevskiy

This adds special support for huge pages (16MB).  The reference
counting cannot be easily done for such pages in real mode (when
MMU is off) so we added a list of huge pages.  It is populated in
virtual mode and get_page is called just once per a huge page.
Real mode handlers check if the requested page is huge and in the list,
then no reference counting is done, otherwise an exit to virtual mode
happens.  The list is released at KVM exit.  At the moment the fastest
card available for tests uses up to 9 huge pages so walking through this
list is not very expensive.  However this can change and we may want
to optimize this.

Cc: David Gibson 
Signed-off-by: Alexey Kardashevskiy 
Signed-off-by: Paul Mackerras 

---

Changes:
* the real mode handler now searches for a huge page by gpa (used to be pte)
* the virtual mode handler prints warning if it is called twice for the same
huge page as the real mode handler is expected to fail just once - when a huge
page is not in the list yet.
* the huge page is refcounted twice - when added to the hugepage list and
when used in the virtual mode hcall handler (can be optimized but it will
make the patch less nice).
---
 arch/powerpc/include/asm/kvm_host.h |2 +
 arch/powerpc/include/asm/kvm_ppc.h  |   22 +
 arch/powerpc/kvm/book3s_64_vio.c|   88 +--
 arch/powerpc/kvm/book3s_64_vio_hv.c |   40 ++--
 4 files changed, 146 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index ac0e2fe..4fc0865 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -181,6 +181,8 @@ struct kvmppc_spapr_tce_table {
u64 liobn;
u32 window_size;
struct iommu_group *grp;/* used for IOMMU groups */
+   struct list_head hugepages; /* used for IOMMU groups */
+   spinlock_t hugepages_lock;  /* used for IOMMU groups */
struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 934e01d..9054df0 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -149,6 +149,28 @@ extern long kvmppc_virtmode_h_put_tce_indirect(struct 
kvm_vcpu *vcpu,
 extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
unsigned long liobn, unsigned long ioba,
unsigned long tce_value, unsigned long npages);
+
+/*
+ * The KVM guest can be backed with 16MB pages (qemu switch
+ * -mem-path /var/lib/hugetlbfs/global/pagesize-16MB/).
+ * In this case, we cannot do page counting from the real mode
+ * as the compound pages are used - they are linked in a list
+ * with pointers as virtual addresses which are inaccessible
+ * in real mode.
+ *
+ * The code below keeps a 16MB pages list and uses page struct
+ * in real mode if it is already locked in RAM and inserted into
+ * the list or switches to the virtual mode where it can be
+ * handled in a usual manner.
+ */
+struct kvmppc_iommu_hugepage {
+   struct list_head list;
+   pte_t pte;  /* Huge page PTE */
+   unsigned long gpa;  /* Guest physical address */
+   struct page *page;  /* page struct of the very first subpage */
+   unsigned long size; /* Huge page size (always 16MB at the moment) */
+};
+
 extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
struct kvm_allocate_rma *rma);
 extern struct kvmppc_linear_info *kvm_alloc_rma(void);
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index ffb4698..c34d63a 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -45,6 +45,71 @@
 #define TCES_PER_PAGE  (PAGE_SIZE / sizeof(u64))
 #define ERROR_ADDR  ((void *)~(unsigned long)0x0)
 
+#ifdef CONFIG_IOMMU_API
+/* Adds a new huge page descriptor to the list  */
+static long kvmppc_iommu_hugepage_try_add(
+   struct kvmppc_spapr_tce_table *tt,
+   pte_t pte, unsigned long hva, unsigned long gpa,
+   unsigned long pg_size)
+{
+   long ret = 0;
+   struct kvmppc_iommu_hugepage *hp;
+   struct page *p;
+
+   spin_lock(&tt->hugepages_lock);
+   list_for_each_entry(hp, &tt->hugepages, list) {
+   if (hp->pte == pte)
+   goto unlock_exit;
+   }
+
+   hva = hva & ~(pg_size - 1);
+   ret = get_user_pages_fast(hva, 1, true/*write*/, &p);
+   if ((ret != 1) || !p) {
+   ret = -EFAULT;
+   goto unlock_exit;
+   }
+   ret = 0;
+
+   hp = kzalloc(sizeof(*hp), GFP_KERNEL);
+   if (!hp) {
+   ret = -ENOMEM;
+   goto unlock_exit;
+   }
+
+   hp->page = p;
+   hp->pte = pte;
+   hp->gpa = gpa & ~(pg_size - 1);
+   hp->size = pg_size;
+
+   list_add(&hp->list, &tt->hugepages);
+
+unlock_exit:
+   spin_unlock(&tt->hugepages_lock)

[PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap

2013-05-20 Thread Alexey Kardashevskiy

The current VFIO-on-POWER implementation supports only user mode
driven mapping, i.e. QEMU is sending requests to map/unmap pages.
However this approach is really slow, so we want to move that to KVM.
Since H_PUT_TCE can be extremely performance sensitive (especially with
network adapters where each packet needs to be mapped/unmapped) we chose
to implement that as a "fast" hypercall directly in "real
mode" (processor still in the guest context but MMU off).

To be able to do that, we need to provide some facilities to
access the struct page count within that real mode environment as things
like the sparsemem vmemmap mappings aren't accessible.

This adds an API to increment/decrement page counter as
get_user_pages API used for user mode mapping does not work
in the real mode.

CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: Paul Mackerras 
Cc: David Gibson 
Signed-off-by: Paul Mackerras 

---

Changes:
2013-05-20:
* PageTail() is replaced by PageCompound() in order to have the same checks
for whether the page is huge in realmode_get_page() and realmode_put_page()
---
 arch/powerpc/include/asm/pgtable-ppc64.h |4 ++
 arch/powerpc/mm/init_64.c|   77 +-
 2 files changed, 80 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h 
b/arch/powerpc/include/asm/pgtable-ppc64.h
index e3d55f6f..7b46e5f 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -376,6 +376,10 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t 
*pgdir, unsigned long ea,
 }
 #endif /* !CONFIG_HUGETLB_PAGE */
 
+struct page *realmode_pfn_to_page(unsigned long pfn);
+int realmode_get_page(struct page *page);
+int realmode_put_page(struct page *page);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index c2787bf..ba6cf9b 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -296,5 +296,80 @@ void vmemmap_free(unsigned long start, unsigned long end)
 {
 }
 
-#endif /* CONFIG_SPARSEMEM_VMEMMAP */
+/*
+ * We do not have access to the sparsemem vmemmap, so we fallback to
+ * walking the list of sparsemem blocks which we already maintain for
+ * the sake of crashdump. In the long run, we might want to maintain
+ * a tree if performance of that linear walk becomes a problem.
+ *
+ * Any of realmode_ functions can fail due to:
+ * 1) As real sparsemem blocks do not lay in RAM continously (they
+ * are in virtual address space which is not available in the real mode),
+ * the requested page struct can be split between blocks so get_page/put_page
+ * may fail.
+ * 2) When huge pages are used, the get_page/put_page API will fail
+ * in real mode as the linked addresses in the page struct are virtual
+ * too.
+ * When 1) or 2) takes place, the API returns an error code to cause
+ * an exit to kernel virtual mode where the operation will be completed.
+ */
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+   struct vmemmap_backing *vmem_back;
+   struct page *page;
+   unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
+   unsigned long pg_va = (unsigned long) pfn_to_page(pfn);
+
+   for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back->list) {
+   if (pg_va < vmem_back->virt_addr)
+   continue;
 
+   /* Check that page struct is not split between real pages */
+   if ((pg_va + sizeof(struct page)) >
+   (vmem_back->virt_addr + page_size))
+   return NULL;
+
+   page = (struct page *) (vmem_back->phys + pg_va -
+   vmem_back->virt_addr);
+   return page;
+   }
+
+   return NULL;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#elif defined(CONFIG_FLATMEM)
+
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+   struct page *page = pfn_to_page(pfn);
+   return page;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */
+
+#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM)
+int realmode_get_page(struct page *page)
+{
+   if (PageCompound(page))
+   return -EAGAIN;
+
+   get_page(page);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(realmode_get_page);
+
+int realmode_put_page(struct page *page)
+{
+   if (PageCompound(page))
+   return -EAGAIN;
+
+   if (!atomic_add_unless(&page->_count, -1, 1))
+   return -EAGAIN;
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(realmode_put_page);
+#endif
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls

2013-05-20 Thread Alexey Kardashevskiy

This adds real mode handlers for the H_PUT_TCE_INDIRECT and
H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio
devices or emulated PCI.  These calls allow adding multiple entries
(up to 512) into the TCE table in one call which saves time on
transition to/from real mode.

This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
(copied from user and verified) before writing the whole list into
the TCE table. This cache will be utilized more in the upcoming
VFIO/IOMMU support to continue TCE list processing in the virtual
mode in the case if the real mode handler failed for some reason.

This adds a guest physical to host real address converter
and calls the existing H_PUT_TCE handler. The converting function
is going to be fully utilized by upcoming VFIO supporting patches.

This also implements the KVM_CAP_PPC_MULTITCE capability,
so in order to support the functionality of this patch, QEMU
needs to query for this capability and set the "hcall-multi-tce"
hypertas property only if the capability is present, otherwise
there will be serious performance degradation.

Cc: David Gibson 
Signed-off-by: Alexey Kardashevskiy 
Signed-off-by: Paul Mackerras 

---
Changelog:
* added kvm_vcpu_arch::tce_tmp
* removed cleanup if put_indirect failed, instead we do not even start
writing to TCE table if we cannot get TCEs from the user and they are
invalid
* kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
and kvmppc_emulated_validate_tce (for the previous item)
* fixed bug with failthrough for H_IPI
* removed all get_user() from real mode handlers
* kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
---
 Documentation/virtual/kvm/api.txt   |   14 ++
 arch/powerpc/include/asm/kvm_host.h |2 +
 arch/powerpc/include/asm/kvm_ppc.h  |   16 +-
 arch/powerpc/kvm/book3s_64_vio.c|  118 ++
 arch/powerpc/kvm/book3s_64_vio_hv.c |  266 +++
 arch/powerpc/kvm/book3s_hv.c|   39 +
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |6 +
 arch/powerpc/kvm/book3s_pr_papr.c   |   37 -
 arch/powerpc/kvm/powerpc.c  |3 +
 include/uapi/linux/kvm.h|1 +
 10 files changed, 470 insertions(+), 32 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 5f91eda..3c7c7ea 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2780,3 +2780,17 @@ Parameters: args[0] is the XICS device fd
 args[1] is the XICS CPU number (server ID) for this vcpu
 
 This capability connects the vcpu to an in-kernel XICS device.
+
+6.8 KVM_CAP_PPC_MULTITCE
+
+Architectures: ppc
+Parameters: none
+Returns: 0 on success; -1 on error
+
+This capability enables the guest to put/remove multiple TCE entries
+per hypercall which significanly accelerates DMA operations for PPC KVM
+guests.
+
+When this capability is enabled, H_PUT_TCE_INDIRECT and H_STUFF_TCE are
+expected to occur rather than H_PUT_TCE which supports only one TCE entry
+per call.
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index af326cd..85d8f26 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -609,6 +609,8 @@ struct kvm_vcpu_arch {
spinlock_t tbacct_lock;
u64 busy_stolen;
u64 busy_preempt;
+
+   unsigned long *tce_tmp;/* TCE cache for TCE_PUT_INDIRECT hall */
 #endif
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index a5287fe..e852921b 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce *args);
-extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
-unsigned long ioba, unsigned long tce);
+extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
+   struct kvm_vcpu *vcpu, unsigned long liobn);
+extern long kvmppc_emulated_validate_tce(unsigned long tce);
+extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
+   unsigned long ioba, unsigned long tce);
+extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
+   unsigned long liobn, unsigned long ioba,
+   unsigned long tce);
+extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+   unsigned long liobn, unsigned long ioba,
+   unsigned long tce_list, unsigned long npages);
+extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
+   unsigned long liobn, unsigned long ioba,
+   unsigned long tce_value, unsigned long npages);
 extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
struct

[PATCH 0/4 v2] KVM: PPC: IOMMU in-kernel handling

2013-05-20 Thread Alexey Kardashevskiy

This accelerates IOMMU operations in real and virtual
mode in the host kernel for the KVM guest.

The first patch with multitce support is useful for emulated devices as is.

The other patches are designed for VFIO although this series
does not contain any VFIO related code as the connection between
VFIO and the new handlers is to be made in QEMU
via ioctl to the KVM fd.

The series was made and tested against v3.10-rc1.


Alexey Kardashevskiy (4):
  KVM: PPC: Add support for multiple-TCE hcalls
  powerpc: Prepare to support kernel handling of IOMMU map/unmap
  KVM: PPC: Add support for IOMMU in-kernel handling
  KVM: PPC: Add hugepage support for IOMMU in-kernel handling

 Documentation/virtual/kvm/api.txt|   42 +++
 arch/powerpc/include/asm/kvm_host.h  |7 +
 arch/powerpc/include/asm/kvm_ppc.h   |   40 ++-
 arch/powerpc/include/asm/pgtable-ppc64.h |4 +
 arch/powerpc/include/uapi/asm/kvm.h  |7 +
 arch/powerpc/kvm/book3s_64_vio.c |  398 -
 arch/powerpc/kvm/book3s_64_vio_hv.c  |  471 --
 arch/powerpc/kvm/book3s_hv.c |   39 +++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S  |6 +
 arch/powerpc/kvm/book3s_pr_papr.c|   37 ++-
 arch/powerpc/kvm/powerpc.c   |   15 +
 arch/powerpc/mm/init_64.c|   77 -
 include/uapi/linux/kvm.h |5 +
 13 files changed, 1120 insertions(+), 28 deletions(-)

-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] vhost: get 2% performance improved by reducing spin_lock race in vhost_work_queue

2013-05-20 Thread Qinchuanyu

From: Chuanyu Qin 
Subject: [PATCH] get 2% or more performance improved by reducing spin_lock race 
in vhost_work_queue

the wake_up_process func is included by spin_lock/unlock in vhost_work_queue, 
but it could be done outside the spin_lock. 
I have test it with kernel 3.0.27 and guest suse11-sp2 using iperf, the num as 
below.
 orignal   modified
thread_num  tp(Gbps)   vhost(%)  |  tp(Gbps) vhost(%)
1   9.59 28.82   |  9.5927.49
89.6132.92   |  9.6226.77
649.5846.48  | 9.5538.99
2569.663.7   |  9.6 52.59

Signed-off-by: Chuanyu Qin 
---
 drivers/vhost/vhost.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 94dbd25..8bee109 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -146,9 +146,10 @@ static inline void vhost_work_queue(struct vhost_dev *dev,
if (list_empty(&work->node)) {
list_add_tail(&work->node, &dev->work_list);
work->queue_seq++;
+   spin_unlock_irqrestore(&dev->work_lock, flags);
wake_up_process(dev->worker);
-   }
-   spin_unlock_irqrestore(&dev->work_lock, flags);
+   } else
+   spin_unlock_irqrestore(&dev->work_lock, flags);
 }
 
 void vhost_poll_queue(struct vhost_poll *poll)
-- 
1.7.3.1.msysgit.0


> On 05/20/2013 12:22 PM, Qinchuanyu wrote:
> > The patch below is base on
> > https://git.kernel.org/cgit/linux/kernel/git/next/linux-
> next.git/tree/drivers/vhost/vhost.c?id=refs/tags/next-20130517
> >
> > Signed-off-by: Chuanyu Qin 
> > --- a/drivers/vhost/vhost.c 2013-05-20 11:47:05.0 +0800
> > +++ b/drivers/vhost/vhost.c 2013-05-20 11:48:24.0 +0800
> > @@ -154,9 +154,10 @@
> > if (list_empty(&work->node)) {
> > list_add_tail(&work->node, &dev->work_list);
> > work->queue_seq++;
> > +   spin_unlock_irqrestore(&dev->work_lock, flags);
> > wake_up_process(dev->worker);
> > -   }
> > -   spin_unlock_irqrestore(&dev->work_lock, flags);
> > +   } else
> > +   spin_unlock_irqrestore(&dev->work_lock, flags);
> >  }
> >
> >  void vhost_poll_queue(struct vhost_poll *poll)
> >
> > I did the test by using iperf in 10G environment, the test num as
> below:
> >  orignal   modified
> > thread_num  tp(Gbps)   vhost(%)  |  tp(Gbps) vhost(%)
> > 1   9.59 28.82   |  9.5927.49
> > 89.6132.92   |  9.6226.77
> > 649.5846.48  | 9.5538.99
> > 2569.663.7   |  9.6 52.59
> >
> > The cost of vhost reduced while the throughput is almost unchanged.
> 
> Thanks, and please generate a formal patch based on
> Documentation/SubmittingPatches (put the description and perf numbers
> in the commit log). Then resubmit it to let the maintainer apply it.

N�r��yb�X��ǧv�^�)޺{.n�+h����ܨ}���Ơz�&j:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf

RE: [PATCH] virtio-net: Reporting traffic queue distribution statistics through ethtool

2013-05-20 Thread Narasimhan, Sriram



-Original Message-
From: Michael S. Tsirkin [mailto:m...@redhat.com] 
Sent: Monday, May 20, 2013 2:59 AM
To: Narasimhan, Sriram
Cc: ru...@rustcorp.com.au; virtualizat...@lists.linux-foundation.org; 
kvm@vger.kernel.org; net...@vger.kernel.org; linux-ker...@vger.kernel.org; 
Jason Wang
Subject: Re: [PATCH] virtio-net: Reporting traffic queue distribution 
statistics through ethtool

On Sun, May 19, 2013 at 10:56:16PM +, Narasimhan, Sriram wrote:
> Hi Michael,
> 
> Comments inline...
> 
> -Original Message-
> From: Michael S. Tsirkin [mailto:m...@redhat.com] 
> Sent: Sunday, May 19, 2013 1:03 PM
> To: Narasimhan, Sriram
> Cc: ru...@rustcorp.com.au; virtualizat...@lists.linux-foundation.org; 
> kvm@vger.kernel.org; net...@vger.kernel.org; linux-ker...@vger.kernel.org; 
> Jason Wang
> Subject: Re: [PATCH] virtio-net: Reporting traffic queue distribution 
> statistics through ethtool
> 
> On Sun, May 19, 2013 at 04:09:48PM +, Narasimhan, Sriram wrote:
> > Hi Michael,
> > 
> > I was getting all packets on the same inbound queue which
> > is why I added this support to virtio-net (and some more
> > instrumentation at tun as well).  But, it turned out to be
> > my misconfiguration - I did not enable IFF_MULTI_QUEUE on
> > the tap device, so the real_num_tx_queues on tap netdev was 
> > always 1 (no tx distribution at tap).
> 
> Interesting that qemu didn't fail.
> 
> [Sriram] void tun_set_real_num_tx_queues() does not return 
> the EINVAL return from netif_set_real_num_tx_queues() for 
> txq > dev->num_tx_queues (which would be the case if the
> tap device were not created with IFF_MULTI_QUEUE).  I think
> it would be better to fix the code to disable the new 
> queue and fail tun_attach()

You mean fail TUNSETQUEUE?

[Sriram] No I meant TUNSETIFF. FYI, I am using QEMU 1.4.50.
I created the tap device using tunctl.  This does not
specify the IFF_MULTI_QUEUE flag during create so 1 netdev
queue is allocated.  But, when the tun device is closed,
tun_detach decrements tun->numqueues from 1 to 0.

The following were the options I passed to qemu: 
-netdev tap,id=hostnet1,vhost=on,ifname=tap1,queues=4 
-device virtio-net-pci,netdev=hostnet1,id=net1,
mac=52:54:00:9b:8e:24,mq=on,vectors=9,ctrl_vq=on


> in this scenario.  If you 
> agree, I can go ahead and create a separate patch for that.

Hmm I not sure I understand what happens, so hard for me to tell.

I think this code was supposed to handle it:
err = -EBUSY;
if (!(tun->flags & TUN_TAP_MQ) && tun->numqueues == 1)
goto out;

why doesn't it?

[Sriram] This question was raised by Jason as well.
When QEMU is started with multiple queues on the tap 
device, it calls TUNSETIFF on the existing tap device with 
IFF_MULTI_QUEUE set.  The above code falls through since 
tun->numqueues = 0 due to the previous tun_detach() during 
close. At the end of this, tun_set_iff() sets TUN_TAP_MQ
flag for the tun data structure.  From that point onwards,
subsequent TUNSETIFF will fall through resulting in the
mismatch in #queues between tun and netdev structures. 


> 
> > I am thinking about 
> > adding a -q option to tunctl to specify multi-queue flag on
> > the tap device.
> 
> Absolutely.
> 
> [Sriram] OK, let me do that.


[Sriram] I am planning to add ip tuntap multi-queue option 
instead of tunctl.  

Sriram

> 
> > Yes, number of exits will be most useful.  I will look into
> > adding the other statistics you mention.
> > 
> > Sriram  
> 
> Pls note you'll need to switch to virtqueue_kick_prepare
> to detect exits: virtqueue_kick doesn't let you know
> whether there was an exit.
> 
> Also, it's best to make this a separate patch from the one
> adding per-queue stats.
> 
> [Sriram] OK, I will cover only the per-queue statistics in
> this patch. Also, I will address the indentation/data 
> structure name points that you mentioned in your earlier
> email and send a new revision for this patch.
> 
> Sriram
> 
> > -Original Message-
> > From: Michael S. Tsirkin [mailto:m...@redhat.com] 
> > Sent: Sunday, May 19, 2013 4:28 AM
> > To: Narasimhan, Sriram
> > Cc: ru...@rustcorp.com.au; virtualizat...@lists.linux-foundation.org; 
> > kvm@vger.kernel.org; net...@vger.kernel.org; linux-ker...@vger.kernel.org
> > Subject: Re: [PATCH] virtio-net: Reporting traffic queue distribution 
> > statistics through ethtool
> > 
> > On Thu, May 16, 2013 at 01:24:29PM -0700, Sriram Narasimhan wrote:
> > > This patch allows virtio-net driver to report traffic distribution
> > > to inbound/outbound queues through ethtool -S.  The per_cpu
> > > virtnet_stats is split into receive and transmit stats and are
> > > maintained on a per receive_queue and send_queue basis.
> > > virtnet_stats() is modified to aggregate interface level statistics
> > > from per-queue statistics.  Sample output below:
> > > 
> > 
> > Thanks for the patch. The idea itself looks OK to me.
> > Ben Hutchings already sent some comments
> > so I won't repeat them. Some min

I/O port permission bit inheritance between threads

2013-05-20 Thread Stephen Hemminger

ioperm() inheritance across threads is different in KVM then when run
on physical hardware.  The following program runs on physical hardware
but get SEGV under KVM.

It appears that the I/O permission bits are not shared between threads
in the same way.

/* Original Copyright 2011, Kees Cook , License: GPLv2 */
#include 
#include 
#include 
#include 

static void *beep(void *arg)
{
unsigned char bits;

fprintf(stderr, "waiting\n");
sleep(1);

fprintf(stderr, "beeping\n");
/* turn on speaker */
bits = inb(0x61);
bits |= 3;
outb(bits, 0x61);

/* set 1000 Hz frequency */
bits = 0xA9;
outb(bits, 0x42);
bits = 0x04;
outb(bits, 0x42);

/* listen to the beep */
sleep(4);
fprintf(stderr, "done\n");

return NULL;
}

int main() {
pthread_t tid;
unsigned char orig;

if (pthread_create(&tid, NULL, &beep, NULL)) {
perror("pthread");
return 1;
}

/* gain access to speaker control port */
if (ioperm(0x61, 0x61, 1) < 0) {
perror("0x61");
return 1;
}
orig = inb(0x61);

/* gain access to speaker frequency port */
if (ioperm(0x42, 0x42, 1) < 0) {
perror("0x42");
return 2;
}

fprintf(stderr, "joining\n");
pthread_join(tid, NULL);

/* restore speaker bits to turn off speaker */
outb(orig, 0x61);
fprintf(stderr, "done\n");
return 0;
}
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 4/5] mips/kvm: Use ARRAY_SIZE() instead of hardcoded constants in kvm_arch_vcpu_ioctl_{s,g}et_regs

2013-05-20 Thread David Daney

From: David Daney 

Also we cannot set special zero register, so force it to zero.

Signed-off-by: David Daney 
---
 arch/mips/kvm/kvm_mips.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/mips/kvm/kvm_mips.c b/arch/mips/kvm/kvm_mips.c
index 93da750..71a1fc1 100644
--- a/arch/mips/kvm/kvm_mips.c
+++ b/arch/mips/kvm/kvm_mips.c
@@ -677,9 +677,9 @@ int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, 
struct kvm_regs *regs)
 {
int i;
 
-   for (i = 0; i < 32; i++)
+   for (i = 1; i < ARRAY_SIZE(vcpu->arch.gprs); i++)
vcpu->arch.gprs[i] = regs->gpr[i];
-
+   vcpu->arch.gprs[0] = 0; /* zero is special, and cannot be set. */
vcpu->arch.hi = regs->hi;
vcpu->arch.lo = regs->lo;
vcpu->arch.pc = regs->pc;
@@ -691,7 +691,7 @@ int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, 
struct kvm_regs *regs)
 {
int i;
 
-   for (i = 0; i < 32; i++)
+   for (i = 0; i < ARRAY_SIZE(vcpu->arch.gprs); i++)
regs->gpr[i] = vcpu->arch.gprs[i];
 
regs->hi = vcpu->arch.hi;
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 1/5] mips/kvm: Fix ABI for use of FPU.

2013-05-20 Thread David Daney

From: David Daney 

Define a non-empty struct kvm_fpu.

Signed-off-by: David Daney 
---
 arch/mips/include/asm/kvm.h | 29 +
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/arch/mips/include/asm/kvm.h b/arch/mips/include/asm/kvm.h
index 85789ea..0e8f565 100644
--- a/arch/mips/include/asm/kvm.h
+++ b/arch/mips/include/asm/kvm.h
@@ -1,11 +1,12 @@
 /*
-* This file is subject to the terms and conditions of the GNU General Public
-* License.  See the file "COPYING" in the main directory of this archive
-* for more details.
-*
-* Copyright (C) 2012  MIPS Technologies, Inc.  All rights reserved.
-* Authors: Sanjay Lal 
-*/
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License.  See the file "COPYING" in the main directory of this archive
+ * for more details.
+ *
+ * Copyright (C) 2012  MIPS Technologies, Inc.  All rights reserved.
+ * Copyright (C) 2013 Cavium, Inc.
+ * Authors: Sanjay Lal 
+ */
 
 #ifndef __LINUX_KVM_MIPS_H
 #define __LINUX_KVM_MIPS_H
@@ -31,8 +32,20 @@ struct kvm_regs {
 struct kvm_sregs {
 };
 
-/* for KVM_GET_FPU and KVM_SET_FPU */
+/*
+ * for KVM_GET_FPU and KVM_SET_FPU
+ *
+ * If Status[FR] is zero (32-bit FPU), the upper 32-bits of the FPRs
+ * are zero filled.
+ */
 struct kvm_fpu {
+   __u64 fpr[32];
+   __u32 fir;
+   __u32 fccr;
+   __u32 fexr;
+   __u32 fenr;
+   __u32 fcsr;
+   __u32 pad;
 };
 
 struct kvm_debug_exit_arch {
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 5/5] mips/kvm: Fix ABI by moving manipulation of CP0 registers to KVM_{G,S}ET_MSRS

2013-05-20 Thread David Daney

From: David Daney 

Because not all 256 CP0 registers are ever implemented, we need a
different method of manipulating them.  Use the
KVM_GET_MSRS/KVM_SET_MSRS mechanism as x86 does for its MSRs.

Code related to implementing KVM_GET_MSRS/KVM_SET_MSRS is consolidated
in to kvm_trap_emul.c, now unused code and definitions are removed.

Signed-off-by: David Daney 
---
 arch/mips/include/asm/kvm.h  |  69 ++--
 arch/mips/include/asm/kvm_host.h |   4 -
 arch/mips/kvm/kvm_mips.c |  92 +--
 arch/mips/kvm/kvm_trap_emul.c| 330 ++-
 4 files changed, 350 insertions(+), 145 deletions(-)

diff --git a/arch/mips/include/asm/kvm.h b/arch/mips/include/asm/kvm.h
index d145ead..cae080c 100644
--- a/arch/mips/include/asm/kvm.h
+++ b/arch/mips/include/asm/kvm.h
@@ -13,10 +13,11 @@
 
 #include 
 
-#define __KVM_MIPS
-
-#define N_MIPS_COPROC_REGS  32
-#define N_MIPS_COPROC_SEL  8
+/*
+ * KVM MIPS specific structures and definitions.
+ *
+ * Some parts derived from the x86 version of this file.
+ */
 
 /*
  * for KVM_GET_REGS and KVM_SET_REGS
@@ -31,12 +32,6 @@ struct kvm_regs {
__u64 hi;
__u64 lo;
__u64 pc;
-
-   __u32 cp0reg[N_MIPS_COPROC_REGS][N_MIPS_COPROC_SEL];
-};
-
-/* for KVM_GET_SREGS and KVM_SET_SREGS */
-struct kvm_sregs {
 };
 
 /*
@@ -55,21 +50,67 @@ struct kvm_fpu {
__u32 pad;
 };
 
+
+/*
+ * For MIPS, we use the same APIs as x86, where 'msr' corresponds to a
+ * CP0 register.  The index field is broken down as follows:
+ *
+ *  bits[2..0]   - Register 'sel' index.
+ *  bits[7..3]   - Register 'rd'  index.
+ *  bits[15..8]  - Must be zero.
+ *  bits[31..16] - 0 -> CP0 registers.
+ *
+ * Other sets registers may be added in the future.  Each set would
+ * have its own identifier in bits[31..16].
+ *
+ * For MSRs that are narrower than 64-bits, the value is stored in the
+ * low order bits of the data field, and sign extended to 64-bits.
+ */
+#define KVM_MIPS_MSR_CP0 0
+struct kvm_msr_entry {
+   __u32 index;
+   __u32 reserved;
+   __u64 data;
+};
+
+/* for KVM_GET_MSRS and KVM_SET_MSRS */
+struct kvm_msrs {
+   __u32 nmsrs; /* number of msrs in entries */
+   __u32 pad;
+
+   struct kvm_msr_entry entries[0];
+};
+
+/* for KVM_GET_MSR_INDEX_LIST */
+struct kvm_msr_list {
+   __u32 nmsrs; /* number of msrs in entries */
+   __u32 indices[0];
+};
+
+/*
+ * KVM MIPS specific structures and definitions
+ *
+ */
 struct kvm_debug_exit_arch {
+   __u64 epc;
 };
 
 /* for KVM_SET_GUEST_DEBUG */
 struct kvm_guest_debug_arch {
 };
 
+/* definition of registers in kvm_run */
+struct kvm_sync_regs {
+};
+
+/* dummy definition */
+struct kvm_sregs {
+};
+
 struct kvm_mips_interrupt {
/* in */
__u32 cpu;
__u32 irq;
 };
 
-/* definition of registers in kvm_run */
-struct kvm_sync_regs {
-};
-
 #endif /* __LINUX_KVM_MIPS_H */
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 143875c..4d6fa0b 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -496,10 +496,6 @@ struct kvm_mips_callbacks {
uint32_t cause);
int (*irq_clear) (struct kvm_vcpu *vcpu, unsigned int priority,
  uint32_t cause);
-   int (*vcpu_ioctl_get_regs) (struct kvm_vcpu *vcpu,
-   struct kvm_regs *regs);
-   int (*vcpu_ioctl_set_regs) (struct kvm_vcpu *vcpu,
-   struct kvm_regs *regs);
 };
 extern struct kvm_mips_callbacks *kvm_mips_callbacks;
 int kvm_mips_emulation_init(struct kvm_mips_callbacks **install_callbacks);
diff --git a/arch/mips/kvm/kvm_mips.c b/arch/mips/kvm/kvm_mips.c
index 71a1fc1..26a40e3 100644
--- a/arch/mips/kvm/kvm_mips.c
+++ b/arch/mips/kvm/kvm_mips.c
@@ -51,16 +51,6 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
{NULL}
 };
 
-static int kvm_mips_reset_vcpu(struct kvm_vcpu *vcpu)
-{
-   int i;
-   for_each_possible_cpu(i) {
-   vcpu->arch.guest_kernel_asid[i] = 0;
-   vcpu->arch.guest_user_asid[i] = 0;
-   }
-   return 0;
-}
-
 gfn_t unalias_gfn(struct kvm *kvm, gfn_t gfn)
 {
return gfn;
@@ -192,12 +182,6 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
}
 }
 
-long
-kvm_arch_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
-{
-   return -EINVAL;
-}
-
 void kvm_arch_free_memslot(struct kvm_memory_slot *free,
   struct kvm_memory_slot *dont)
 {
@@ -435,42 +419,6 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
 
return r;
 }
-
-int
-kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu, struct kvm_mips_interrupt *irq)
-{
-   int intr = (int)irq->irq;
-   struct kvm_vcpu *dvcpu = NULL;
-
-   if (intr == 3 || intr == -3 || intr == 4 || intr == -4)
-   kvm_debug("%s: CPU: %d, INTR: %d\n", __func__, irq->cpu,
-

Re: VFIO VGA test branches

2013-05-20 Thread Knut Omang

On Sun, 2013-05-19 at 22:15 -0600, Alex Williamson wrote:
> On Sun, 2013-05-19 at 17:35 +0200, Knut Omang wrote:
> > On Mon, 2013-05-13 at 16:23 -0600, Alex Williamson wrote:
> > > On Mon, 2013-05-13 at 22:55 +0200, Knut Omang wrote:
> > > > Hi all,
> > > > 
> > > > Perfect timing from my perspective, thanks Alex!
> > > > 
> > > > I spent the better part of the weekend testing your branches on a new 
> > > > system 
> > > > I just put together for this purpose, results below..
> > > > 
> > > > On Fri, 2013-05-03 at 16:56 -0600, Alex Williamson wrote:
> > > > ...
> > > > > git://github.com/awilliam/linux-vfio.git vfio-vga-reset
> > > > > git://github.com/awilliam/qemu-vfio.git vfio-vga-reset
> > > > 
> > > > System setup: 
> > > > 
> > > > - Fedora 18 on
> > > > - Gigabyte Z77X-UD5H motherboard
> > > > - Intel Core i7 3770 (Ivy bridge w/integrated graphics)
> > > > - 2 discrete graphics cards:
> > > > 
> > > > lspci | egrep 'VGA|Audio'
> > > > 00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 
> > > > v2/3rd Gen Core processor Graphics Controller (rev 09)
> > > > 00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset 
> > > > Family High Definition Audio Controller (rev 04)
> > > > 01:00.0 VGA compatible controller: Advanced Micro Devices [AMD] nee ATI 
> > > > Caicos [Radeon HD 6450]
> > > > 01:00.1 Audio device: Advanced Micro Devices [AMD] nee ATI Caicos HDMI 
> > > > Audio [Radeon HD 6400 Series]
> > > > 02:00.0 VGA compatible controller: Advanced Micro Devices [AMD] nee ATI 
> > > > Cape Verde PRO [Radeon HD 7700 Series]
> > > > 02:00.1 Audio device: Advanced Micro Devices [AMD] nee ATI Cape 
> > > > Verde/Pitcairn HDMI Audio [Radeon HD 7700/7800 Series]
> > > > 
> > > > Short summary:
> > > > 
> > > > - Once I got past a few time consuming obstacles explained below
> > > >- the graphics part of the graphics/hdmi audio passthrough seems to 
> > > > work perfect
> > > >  on both discrete graphics cards 
> > > >  (though so far only one at at time and with some minor issues, see 
> > > > below)
> > > >- no success with the hdmi audio yet (ideas for further 
> > > > investigation appreciated!)
> > > 
> > > I've had hdmi audio working with an HD7850, but only in Windows (7) and
> > > it was using legacy interrupts for some reason instead of MSI.  I wonder
> > > if Liux guests might work with snd_hda_intel.enable_msi=0.  I'm not sure
> > > what's wrong with MSI, but it seems to be new with the PCI bus reset
> > > support.
> > 
> > In my first tries, Windows were just using a generic
> > VGA driver, which still seems to work perfect with reboots and everything 
> > and in full screen resolution (1920x1200).
> > However after installing the Catalyst AMD driver stack, upon boot
> > Windows 7 now consequently get a BSOD from the graphics driver
> > with the message:
> > 
> > "Attempt to reset the display driver and recover from timeout failed"
> > - a picture of the BSOD screen attached.
> 
> I've seen that BSOD before, but I don't know how to reproduce it.  It
> seems like I haven't seen it with the PCI bus reset code.  I'm running
> version 13.1 of the catalyst driver, you?

I first tried with the install CD that came with the card - v.13-045
then upgraded to the latest from AMD, catalyst v.13.4 which appears to
be driver v.12.104 - similar behaviour for both. This was with a plain
Windows 7 install from my SP1 DVD. 

With most recommended windows updates and the latest catalyst driver,
the BSOD is gone but instead I see the initial VGA boot screen and the
windows logo, then syncs but no display and then reboot into recovery
mode. (If I try all updates, Windows seems never to be able to recover
from the last reboot)

I have tried without kvm and also with vnc or spice graphics in addition
but in those cases it seems Windows is not able to allocate MMIO
resources for both adapters so I haven't been able to test the catalyst
driver as a secondary windows display.

> > I attach the corresponding vfio log where I added some timing code to
> > make it easier to see when the BSOD happens (with 2 seconds of silence
> > in the log before the VM reboots, I believe this is at 09:28:32-34 in
> > the log.
> 
> Yep, looks like that's where windows starts the BSOD.
> 
> > Similar behaviour both just after reboot/power cycle of the host and
> > subsequent VM boot attempts.
> > 
> > This is still with the HD7700 as passed through device, but after a
> > motherboard firmware upgrade (to F14) which did not seem to affect the
> > observed behaviour on Windows prior to Catalyst install or with Linux
> > guest, neither did it fix the bug in selecting primary devices as I 
> > was hoping for.
> > 
> > Let me know if you have ideas for further debugging this,
> 
> I don't have any great ideas since I don't know how to reproduce the
> timeout.  Double/triple check that you're using the correct
> vfio-vga-reset branches in both qemu and kernel
> 
> # grep VFIO_DEVICE_PCI_BUS_RESET qemu.gi

[PATCH v3 2/5] mips/kvm: Fix ABI for use of 64-bit registers.

2013-05-20 Thread David Daney

From: David Daney 

All registers are 64-bits wide, 32-bit guests use the least
significant portion of the register storage fields.

Signed-off-by: David Daney 
---
 arch/mips/include/asm/kvm.h | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/arch/mips/include/asm/kvm.h b/arch/mips/include/asm/kvm.h
index 0e8f565..86812fb 100644
--- a/arch/mips/include/asm/kvm.h
+++ b/arch/mips/include/asm/kvm.h
@@ -18,12 +18,18 @@
 #define N_MIPS_COPROC_REGS  32
 #define N_MIPS_COPROC_SEL  8
 
-/* for KVM_GET_REGS and KVM_SET_REGS */
+/*
+ * for KVM_GET_REGS and KVM_SET_REGS
+ *
+ * If Config[AT] is zero (32-bit CPU), the register contents are
+ * stored in the lower 32-bits of the struct kvm_regs fields and sign
+ * extended to 64-bits.
+ */
 struct kvm_regs {
-   __u32 gprs[32];
-   __u32 hi;
-   __u32 lo;
-   __u32 pc;
+   __u64 gprs[32];
+   __u64 hi;
+   __u64 lo;
+   __u64 pc;
 
__u32 cp0reg[N_MIPS_COPROC_REGS][N_MIPS_COPROC_SEL];
 };
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 0/5] mips/kvm: Fix ABI for compatibility with 64-bit guests.

2013-05-20 Thread David Daney

From: David Daney 

The initial patch set implementing MIPS KVM does not handle 64-bit
guests or use of the FPU.  This patch set corrects these ABI issues,
and does some very minor clean up.

Changes from v2:  Split into five parts, no code change.

David Daney (5):
  mips/kvm: Fix ABI for use of FPU.
  mips/kvm: Fix ABI for use of 64-bit registers.
  mips/kvm: Fix name of gpr field in struct kvm_regs.
  mips/kvm: Use ARRAY_SIZE() instead of hardcoded constants in
kvm_arch_vcpu_ioctl_{s,g}et_regs
  mips/kvm: Fix ABI by moving manipulation of CP0 registers to
KVM_{G,S}ET_MSRS

 arch/mips/include/asm/kvm.h  | 111 ++---
 arch/mips/include/asm/kvm_host.h |   4 -
 arch/mips/kvm/kvm_mips.c | 102 +---
 arch/mips/kvm/kvm_trap_emul.c| 330 ++-
 4 files changed, 386 insertions(+), 161 deletions(-)

-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 3/5] mips/kvm: Fix name of gpr field in struct kvm_regs.

2013-05-20 Thread David Daney

From: David Daney 

Signed-off-by: David Daney 
---
 arch/mips/include/asm/kvm.h | 3 ++-
 arch/mips/kvm/kvm_mips.c| 4 ++--
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/mips/include/asm/kvm.h b/arch/mips/include/asm/kvm.h
index 86812fb..d145ead 100644
--- a/arch/mips/include/asm/kvm.h
+++ b/arch/mips/include/asm/kvm.h
@@ -26,7 +26,8 @@
  * extended to 64-bits.
  */
 struct kvm_regs {
-   __u64 gprs[32];
+   /* out (KVM_GET_REGS) / in (KVM_SET_REGS) */
+   __u64 gpr[32];
__u64 hi;
__u64 lo;
__u64 pc;
diff --git a/arch/mips/kvm/kvm_mips.c b/arch/mips/kvm/kvm_mips.c
index e0dad02..93da750 100644
--- a/arch/mips/kvm/kvm_mips.c
+++ b/arch/mips/kvm/kvm_mips.c
@@ -678,7 +678,7 @@ int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, 
struct kvm_regs *regs)
int i;
 
for (i = 0; i < 32; i++)
-   vcpu->arch.gprs[i] = regs->gprs[i];
+   vcpu->arch.gprs[i] = regs->gpr[i];
 
vcpu->arch.hi = regs->hi;
vcpu->arch.lo = regs->lo;
@@ -692,7 +692,7 @@ int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, 
struct kvm_regs *regs)
int i;
 
for (i = 0; i < 32; i++)
-   regs->gprs[i] = vcpu->arch.gprs[i];
+   regs->gpr[i] = vcpu->arch.gprs[i];
 
regs->hi = vcpu->arch.hi;
regs->lo = vcpu->arch.lo;
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] VFIO VGA test branches

2013-05-20 Thread Knut Omang

On Sun, 2013-05-19 at 23:26 +0400, Maik Broemme wrote:
> Hi Knut,
> 
> Knut Omang  wrote:
> > 
> > On Mon, 2013-05-13 at 16:23 -0600, Alex Williamson wrote:
> > > On Mon, 2013-05-13 at 22:55 +0200, Knut Omang wrote:
> > > > Hi all,
> > > > 
> > > > Perfect timing from my perspective, thanks Alex!
> > > > 
> > > > I spent the better part of the weekend testing your branches on a new 
> > > > system 
> > > > I just put together for this purpose, results below..
> > > > 
> > > > On Fri, 2013-05-03 at 16:56 -0600, Alex Williamson wrote:
> > > > ...
> > > > > git://github.com/awilliam/linux-vfio.git vfio-vga-reset
> > > > > git://github.com/awilliam/qemu-vfio.git vfio-vga-reset
> > > > 
> > > > System setup: 
> > > > 
> > > > - Fedora 18 on
> > > > - Gigabyte Z77X-UD5H motherboard
> > > > - Intel Core i7 3770 (Ivy bridge w/integrated graphics)
> > > > - 2 discrete graphics cards:
> > > > 
> > > > lspci | egrep 'VGA|Audio'
> > > > 00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 
> > > > v2/3rd Gen Core processor Graphics Controller (rev 09)
> > > > 00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset 
> > > > Family High Definition Audio Controller (rev 04)
> > > > 01:00.0 VGA compatible controller: Advanced Micro Devices [AMD] nee ATI 
> > > > Caicos [Radeon HD 6450]
> > > > 01:00.1 Audio device: Advanced Micro Devices [AMD] nee ATI Caicos HDMI 
> > > > Audio [Radeon HD 6400 Series]
> > > > 02:00.0 VGA compatible controller: Advanced Micro Devices [AMD] nee ATI 
> > > > Cape Verde PRO [Radeon HD 7700 Series]
> > > > 02:00.1 Audio device: Advanced Micro Devices [AMD] nee ATI Cape 
> > > > Verde/Pitcairn HDMI Audio [Radeon HD 7700/7800 Series]
> > > > 
> > > > Short summary:
> > > > 
> > > > - Once I got past a few time consuming obstacles explained below
> > > >- the graphics part of the graphics/hdmi audio passthrough seems to 
> > > > work perfect
> > > >  on both discrete graphics cards 
> > > >  (though so far only one at at time and with some minor issues, see 
> > > > below)
> > > >- no success with the hdmi audio yet (ideas for further 
> > > > investigation appreciated!)
> > > 
> > > I've had hdmi audio working with an HD7850, but only in Windows (7) and
> > > it was using legacy interrupts for some reason instead of MSI.  I wonder
> > > if Liux guests might work with snd_hda_intel.enable_msi=0.  I'm not sure
> > > what's wrong with MSI, but it seems to be new with the PCI bus reset
> > > support.
> > 
> > In my first tries, Windows were just using a generic
> > VGA driver, which still seems to work perfect with reboots and everything 
> > and in full screen resolution (1920x1200).
> > However after installing the Catalyst AMD driver stack, upon boot
> > Windows 7 now consequently get a BSOD from the graphics driver
> > with the message:
> > 
> > "Attempt to reset the display driver and recover from timeout failed"
> > - a picture of the BSOD screen attached.
> > 
> > I attach the corresponding vfio log where I added some timing code to
> > make it easier to see when the BSOD happens (with 2 seconds of silence
> > in the log before the VM reboots, I believe this is at 09:28:32-34 in
> > the log.
> > 
> > Similar behaviour both just after reboot/power cycle of the host and
> > subsequent VM boot attempts.
> > 
> > This is still with the HD7700 as passed through device, but after a
> > motherboard firmware upgrade (to F14) which did not seem to affect the
> > observed behaviour on Windows prior to Catalyst install or with Linux
> > guest, neither did it fix the bug in selecting primary devices as I 
> > was hoping for.
> > 
> > Let me know if you have ideas for further debugging this,
> > 
> 
> I had a similar problem a couple of days ago and posted it in this list.
> I got similar BSOD and tested already the following configurations:
> 
> 1) machine: q35 / kvm: yes / vga: none   / x-vga: on  = qemu freeze
> 2) machine: q35 / kvm: no  / vga: none   / x-vga: on  = qemu freeze
>(with errors below)
> 3) machine: q35 / kvm: yes / vga: none   / x-vga: off = qemu runs with
>100% CPU due to no VGA init (no picture)
> 4) machine: q35 / kvm: yes / vga: cirrus / x-vga: off = qemu runs with
>BOSD on loading atikmpag.sys
> 5) machine: pc  / kvm: yes / vga: cirrus / x-vga: off = qemu runs fine
> 
> However I've re-run the BSOD case already with the following branches
> from Alex:
> 
> git://github.com/awilliam/linux-vfio.git vfio-vga-reset
> git://github.com/awilliam/qemu-vfio.git vfio-vga-reset
> 
> Also with latest seabios and it worked so far. No more BSOD and reboot
> of VM was also possible without suspend / resume the host between.

Hmm, seems you have been more lucky with the choice of chipset/CPU than
me - all my tests are also with these branches but on Intel...

Knut

> > Thanks,
> > 
> > Knut
> > 
> > > > - Contrary to de...@lavabit.com I had no success with using pci-assign 
> > > > for VGA
> > > >   with a standard fedora 18 kernel and fairly re

Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages

2013-05-20 Thread Marcelo Tosatti

On Mon, May 20, 2013 at 11:15:45PM +0300, Gleb Natapov wrote:
> On Mon, May 20, 2013 at 04:46:24PM -0300, Marcelo Tosatti wrote:
> > On Fri, May 17, 2013 at 05:12:58AM +0800, Xiao Guangrong wrote:
> > > The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
> > > walk and zap all shadow pages one by one, also it need to zap all guest
> > > page's rmap and all shadow page's parent spte list. Particularly, things
> > > become worse if guest uses more memory or vcpus. It is not good for
> > > scalability
> > > 
> > > In this patch, we introduce a faster way to invalidate all shadow pages.
> > > KVM maintains a global mmu invalid generation-number which is stored in
> > > kvm->arch.mmu_valid_gen and every shadow page stores the current global
> > > generation-number into sp->mmu_valid_gen when it is created
> > > 
> > > When KVM need zap all shadow pages sptes, it just simply increase the
> > > global generation-number then reload root shadow pages on all vcpus.
> > > Vcpu will create a new shadow page table according to current kvm's
> > > generation-number. It ensures the old pages are not used any more.
> > > Then the invalid-gen pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
> > > are zapped by using lock-break technique
> > > 
> > > Signed-off-by: Xiao Guangrong 
> > > ---
> > >  arch/x86/include/asm/kvm_host.h |2 +
> > >  arch/x86/kvm/mmu.c  |  103 
> > > +++
> > >  arch/x86/kvm/mmu.h  |1 +
> > >  3 files changed, 106 insertions(+), 0 deletions(-)
> > > 
> > > diff --git a/arch/x86/include/asm/kvm_host.h 
> > > b/arch/x86/include/asm/kvm_host.h
> > > index 3741c65..bff7d46 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -222,6 +222,7 @@ struct kvm_mmu_page {
> > >   int root_count;  /* Currently serving as active root */
> > >   unsigned int unsync_children;
> > >   unsigned long parent_ptes;  /* Reverse mapping for parent_pte */
> > > + unsigned long mmu_valid_gen;
> > >   DECLARE_BITMAP(unsync_child_bitmap, 512);
> > >  
> > >  #ifdef CONFIG_X86_32
> > > @@ -529,6 +530,7 @@ struct kvm_arch {
> > >   unsigned int n_requested_mmu_pages;
> > >   unsigned int n_max_mmu_pages;
> > >   unsigned int indirect_shadow_pages;
> > > + unsigned long mmu_valid_gen;
> > >   struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
> > >   /*
> > >* Hash table of struct kvm_mmu_page.
> > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > > index 682ecb4..891ad2c 100644
> > > --- a/arch/x86/kvm/mmu.c
> > > +++ b/arch/x86/kvm/mmu.c
> > > @@ -1839,6 +1839,11 @@ static void clear_sp_write_flooding_count(u64 
> > > *spte)
> > >   __clear_sp_write_flooding_count(sp);
> > >  }
> > >  
> > > +static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> > > +{
> > > + return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> > > +}
> > > +
> > >  static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > >gfn_t gfn,
> > >gva_t gaddr,
> > > @@ -1865,6 +1870,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
> > > kvm_vcpu *vcpu,
> > >   role.quadrant = quadrant;
> > >   }
> > >   for_each_gfn_sp(vcpu->kvm, sp, gfn) {
> > > + if (is_obsolete_sp(vcpu->kvm, sp))
> > > + continue;
> > > +
> > 
> > Whats the purpose of not using pages which are considered "obsolete" ?
> > 
> The same as not using page that is invalid, to not reuse stale
> information. The page may contain ptes that point to invalid slot.

Any pages with stale information will be zapped by kvm_mmu_zap_all().
When that happens, page faults will take place which will automatically 
use the new generation number.

So still not clear why is this necessary.

> > The only purpose of the generation number should be to guarantee forward
> > progress of kvm_mmu_zap_all in face of allocators (kvm_mmu_get_page).
> > 
> > That is, on allocation you store the generation number on the shadow page.
> > The only purpose of that number in the shadow page is so that
> > kvm_mmu_zap_all can skip deleting shadow pages which have been created
> > after kvm_mmu_zap_all began operation.
> > 
> > >   if (!need_sync && sp->unsync)
> > >   need_sync = true;
> > >  
> > > @@ -1901,6 +1909,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
> > > kvm_vcpu *vcpu,
> > >  
> > >   account_shadowed(vcpu->kvm, gfn);
> > >   }
> > > + sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
> > >   init_shadow_page_table(sp);
> > >   trace_kvm_mmu_get_page(sp, true);
> > >   return sp;
> > > @@ -2071,8 +2080,10 @@ static int kvm_mmu_prepare_zap_page(struct kvm 
> > > *kvm, struct kvm_mmu_page *sp,
> > >   ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
> > >   kvm_mmu_page_unlink_children(kvm, sp);
> > >   kvm_mmu_unlink_parents(kvm, sp);
> > > +
> > >   i

Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages

2013-05-20 Thread Gleb Natapov

On Mon, May 20, 2013 at 04:46:24PM -0300, Marcelo Tosatti wrote:
> On Fri, May 17, 2013 at 05:12:58AM +0800, Xiao Guangrong wrote:
> > The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
> > walk and zap all shadow pages one by one, also it need to zap all guest
> > page's rmap and all shadow page's parent spte list. Particularly, things
> > become worse if guest uses more memory or vcpus. It is not good for
> > scalability
> > 
> > In this patch, we introduce a faster way to invalidate all shadow pages.
> > KVM maintains a global mmu invalid generation-number which is stored in
> > kvm->arch.mmu_valid_gen and every shadow page stores the current global
> > generation-number into sp->mmu_valid_gen when it is created
> > 
> > When KVM need zap all shadow pages sptes, it just simply increase the
> > global generation-number then reload root shadow pages on all vcpus.
> > Vcpu will create a new shadow page table according to current kvm's
> > generation-number. It ensures the old pages are not used any more.
> > Then the invalid-gen pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
> > are zapped by using lock-break technique
> > 
> > Signed-off-by: Xiao Guangrong 
> > ---
> >  arch/x86/include/asm/kvm_host.h |2 +
> >  arch/x86/kvm/mmu.c  |  103 
> > +++
> >  arch/x86/kvm/mmu.h  |1 +
> >  3 files changed, 106 insertions(+), 0 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h 
> > b/arch/x86/include/asm/kvm_host.h
> > index 3741c65..bff7d46 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -222,6 +222,7 @@ struct kvm_mmu_page {
> > int root_count;  /* Currently serving as active root */
> > unsigned int unsync_children;
> > unsigned long parent_ptes;  /* Reverse mapping for parent_pte */
> > +   unsigned long mmu_valid_gen;
> > DECLARE_BITMAP(unsync_child_bitmap, 512);
> >  
> >  #ifdef CONFIG_X86_32
> > @@ -529,6 +530,7 @@ struct kvm_arch {
> > unsigned int n_requested_mmu_pages;
> > unsigned int n_max_mmu_pages;
> > unsigned int indirect_shadow_pages;
> > +   unsigned long mmu_valid_gen;
> > struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
> > /*
> >  * Hash table of struct kvm_mmu_page.
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index 682ecb4..891ad2c 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -1839,6 +1839,11 @@ static void clear_sp_write_flooding_count(u64 *spte)
> > __clear_sp_write_flooding_count(sp);
> >  }
> >  
> > +static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> > +{
> > +   return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> > +}
> > +
> >  static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> >  gfn_t gfn,
> >  gva_t gaddr,
> > @@ -1865,6 +1870,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
> > kvm_vcpu *vcpu,
> > role.quadrant = quadrant;
> > }
> > for_each_gfn_sp(vcpu->kvm, sp, gfn) {
> > +   if (is_obsolete_sp(vcpu->kvm, sp))
> > +   continue;
> > +
> 
> Whats the purpose of not using pages which are considered "obsolete" ?
> 
The same as not using page that is invalid, to not reuse stale
information. The page may contain ptes that point to invalid slot.

> The only purpose of the generation number should be to guarantee forward
> progress of kvm_mmu_zap_all in face of allocators (kvm_mmu_get_page).
> 
> That is, on allocation you store the generation number on the shadow page.
> The only purpose of that number in the shadow page is so that
> kvm_mmu_zap_all can skip deleting shadow pages which have been created
> after kvm_mmu_zap_all began operation.
> 
> > if (!need_sync && sp->unsync)
> > need_sync = true;
> >  
> > @@ -1901,6 +1909,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
> > kvm_vcpu *vcpu,
> >  
> > account_shadowed(vcpu->kvm, gfn);
> > }
> > +   sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
> > init_shadow_page_table(sp);
> > trace_kvm_mmu_get_page(sp, true);
> > return sp;
> > @@ -2071,8 +2080,10 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, 
> > struct kvm_mmu_page *sp,
> > ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
> > kvm_mmu_page_unlink_children(kvm, sp);
> > kvm_mmu_unlink_parents(kvm, sp);
> > +
> > if (!sp->role.invalid && !sp->role.direct)
> > unaccount_shadowed(kvm, sp->gfn);
> > +
> > if (sp->unsync)
> > kvm_unlink_unsync_page(kvm, sp);
> >  
> > @@ -4196,6 +4207,98 @@ restart:
> > spin_unlock(&kvm->mmu_lock);
> >  }
> >  
> > +static void kvm_zap_obsolete_pages(struct kvm *kvm)
> > +{
> > +   struct kvm_mmu_page *sp, *node;
> > +   LIST_HEAD(invalid_list);
> > +
> > +restart:
> > +   lis

Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages

2013-05-20 Thread Marcelo Tosatti

On Fri, May 17, 2013 at 05:12:58AM +0800, Xiao Guangrong wrote:
> The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
> walk and zap all shadow pages one by one, also it need to zap all guest
> page's rmap and all shadow page's parent spte list. Particularly, things
> become worse if guest uses more memory or vcpus. It is not good for
> scalability
> 
> In this patch, we introduce a faster way to invalidate all shadow pages.
> KVM maintains a global mmu invalid generation-number which is stored in
> kvm->arch.mmu_valid_gen and every shadow page stores the current global
> generation-number into sp->mmu_valid_gen when it is created
> 
> When KVM need zap all shadow pages sptes, it just simply increase the
> global generation-number then reload root shadow pages on all vcpus.
> Vcpu will create a new shadow page table according to current kvm's
> generation-number. It ensures the old pages are not used any more.
> Then the invalid-gen pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
> are zapped by using lock-break technique
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  arch/x86/include/asm/kvm_host.h |2 +
>  arch/x86/kvm/mmu.c  |  103 
> +++
>  arch/x86/kvm/mmu.h  |1 +
>  3 files changed, 106 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 3741c65..bff7d46 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -222,6 +222,7 @@ struct kvm_mmu_page {
>   int root_count;  /* Currently serving as active root */
>   unsigned int unsync_children;
>   unsigned long parent_ptes;  /* Reverse mapping for parent_pte */
> + unsigned long mmu_valid_gen;
>   DECLARE_BITMAP(unsync_child_bitmap, 512);
>  
>  #ifdef CONFIG_X86_32
> @@ -529,6 +530,7 @@ struct kvm_arch {
>   unsigned int n_requested_mmu_pages;
>   unsigned int n_max_mmu_pages;
>   unsigned int indirect_shadow_pages;
> + unsigned long mmu_valid_gen;
>   struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
>   /*
>* Hash table of struct kvm_mmu_page.
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 682ecb4..891ad2c 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1839,6 +1839,11 @@ static void clear_sp_write_flooding_count(u64 *spte)
>   __clear_sp_write_flooding_count(sp);
>  }
>  
> +static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> +{
> + return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> +}
> +
>  static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>gfn_t gfn,
>gva_t gaddr,
> @@ -1865,6 +1870,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
> kvm_vcpu *vcpu,
>   role.quadrant = quadrant;
>   }
>   for_each_gfn_sp(vcpu->kvm, sp, gfn) {
> + if (is_obsolete_sp(vcpu->kvm, sp))
> + continue;
> +

Whats the purpose of not using pages which are considered "obsolete" ?

The only purpose of the generation number should be to guarantee forward
progress of kvm_mmu_zap_all in face of allocators (kvm_mmu_get_page).

That is, on allocation you store the generation number on the shadow page.
The only purpose of that number in the shadow page is so that
kvm_mmu_zap_all can skip deleting shadow pages which have been created
after kvm_mmu_zap_all began operation.

>   if (!need_sync && sp->unsync)
>   need_sync = true;
>  
> @@ -1901,6 +1909,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
> kvm_vcpu *vcpu,
>  
>   account_shadowed(vcpu->kvm, gfn);
>   }
> + sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
>   init_shadow_page_table(sp);
>   trace_kvm_mmu_get_page(sp, true);
>   return sp;
> @@ -2071,8 +2080,10 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, 
> struct kvm_mmu_page *sp,
>   ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
>   kvm_mmu_page_unlink_children(kvm, sp);
>   kvm_mmu_unlink_parents(kvm, sp);
> +
>   if (!sp->role.invalid && !sp->role.direct)
>   unaccount_shadowed(kvm, sp->gfn);
> +
>   if (sp->unsync)
>   kvm_unlink_unsync_page(kvm, sp);
>  
> @@ -4196,6 +4207,98 @@ restart:
>   spin_unlock(&kvm->mmu_lock);
>  }
>  
> +static void kvm_zap_obsolete_pages(struct kvm *kvm)
> +{
> + struct kvm_mmu_page *sp, *node;
> + LIST_HEAD(invalid_list);
> +
> +restart:
> + list_for_each_entry_safe_reverse(sp, node,
> +   &kvm->arch.active_mmu_pages, link) {
> + /*
> +  * No obsolete page exists before new created page since
> +  * active_mmu_pages is the FIFO list.
> +  */
> + if (!is_obsolete_sp(kvm, sp))
> + break;
> +
> + /

Re: [PATCH 00/18] KVM/MIPS32: Support for the new Virtualization ASE (VZ-ASE)

2013-05-20 Thread David Daney


On 05/20/2013 11:36 AM, Maciej W. Rozycki wrote:

On Mon, 20 May 2013, Sanjay Lal wrote:


(1) Newer versions of the MIPS architecture define scratch registers for
just this purpose, but since we have to support standard MIPS32R2
processors, we use the DDataLo Register (CP0 Register 28, Select 3) as a
scratch register to save k0 and save k1 @ a known offset from EBASE.


  That's rather risky as the implementation of this register (and its
presence in the first place) is processor-specific.  Do you maintain a
list of PRId values the use of this register is safe with?



FWIW:  The MIPS-VZ architecture module requires the presence of CP0 
scratch registers that can be used for this in the exception handlers 
without having to worry about using these implementation dependent 
registers.  For the trap-and-emulate only version, there really is no 
choice other than to re-purpose some of the existing CP0 registers.


David Daney

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/18] KVM/MIPS32: Support for the new Virtualization ASE (VZ-ASE)

2013-05-20 Thread Maciej W. Rozycki

On Mon, 20 May 2013, Sanjay Lal wrote:

> (1) Newer versions of the MIPS architecture define scratch registers for 
> just this purpose, but since we have to support standard MIPS32R2 
> processors, we use the DDataLo Register (CP0 Register 28, Select 3) as a 
> scratch register to save k0 and save k1 @ a known offset from EBASE.

 That's rather risky as the implementation of this register (and its 
presence in the first place) is processor-specific.  Do you maintain a 
list of PRId values the use of this register is safe with?

  Maciej
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/18] KVM/MIPS32: Support for the new Virtualization ASE (VZ-ASE)

2013-05-20 Thread Sanjay Lal


On May 20, 2013, at 10:29 AM, David Daney wrote:

> On 05/20/2013 09:58 AM, Sanjay Lal wrote:
>> 
>> On May 20, 2013, at 8:50 AM, David Daney wrote:
>> 
>>> On 05/18/2013 10:47 PM, Sanjay Lal wrote:
 The following patch set adds support for the recently announced 
 virtualization
 extensions for the MIPS32 architecture and allows running unmodified 
 kernels in
 Guest Mode.
 
 For more info please refer to :
MIPS Document #: MD00846
Volume IV-i: Virtualization Module of the MIPS32 Architecture
 
 which can be accessed @: 
 http://www.mips.com/auth/MD00846-2B-VZMIPS32-AFP-01.03.pdf
 
 The patch is agains Linux-3.10-rc1.
 
 KVM/MIPS now supports 2 modes of operation:
 
 (1) VZ mode: Unmodified kernels running in Guest Mode.  The processor now 
 provides
 an almost complete COP0 context in Guest mode. This greatly reduces VM 
 exits.
>>> 
>>> Two questions:
>>> 
>>> 1) How are you handling not clobbering the Guest K0/K1 registers when a 
>>> Root exception occurs?  It is not obvious to me from inspecting the code.
>>> 
>>> 2) What environment are you using to test this stuff?
>>> 
>>> David Daney
>>> 
>> 
>> (1) Newer versions of the MIPS architecture define scratch registers for 
>> just this purpose, but since we have to support standard MIPS32R2 
>> processors, we use the DDataLo Register (CP0 Register 28, Select 3) as a 
>> scratch register to save k0 and save k1 @ a known offset from EBASE.
>> 
> 
> Right, I understand that.  But I am looking at arch/mips/mm/tlbex.c, and I 
> don't see the code that does that for TLBRefill exceptions.
> 
> Where is it done for interrupts?  I would expect code in 
> arch/mips/kernel/genex.S and/or stackframe.h would handle this.  But I don't 
> see where it is.
> 
> Am I missing something?
> 
> David Daney
> 


arch/mips/kvm/kvm_locore.S


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/18] KVM/MIPS32: Support for the new Virtualization ASE (VZ-ASE)

2013-05-20 Thread David Daney


On 05/20/2013 09:58 AM, Sanjay Lal wrote:


On May 20, 2013, at 8:50 AM, David Daney wrote:


On 05/18/2013 10:47 PM, Sanjay Lal wrote:

The following patch set adds support for the recently announced virtualization
extensions for the MIPS32 architecture and allows running unmodified kernels in
Guest Mode.

For more info please refer to :
MIPS Document #: MD00846
Volume IV-i: Virtualization Module of the MIPS32 Architecture

which can be accessed @: 
http://www.mips.com/auth/MD00846-2B-VZMIPS32-AFP-01.03.pdf

The patch is agains Linux-3.10-rc1.

KVM/MIPS now supports 2 modes of operation:

(1) VZ mode: Unmodified kernels running in Guest Mode.  The processor now 
provides
 an almost complete COP0 context in Guest mode. This greatly reduces VM 
exits.


Two questions:

1) How are you handling not clobbering the Guest K0/K1 registers when a Root 
exception occurs?  It is not obvious to me from inspecting the code.

2) What environment are you using to test this stuff?

David Daney



(1) Newer versions of the MIPS architecture define scratch registers for just 
this purpose, but since we have to support standard MIPS32R2 processors, we use 
the DDataLo Register (CP0 Register 28, Select 3) as a scratch register to save 
k0 and save k1 @ a known offset from EBASE.



Right, I understand that.  But I am looking at arch/mips/mm/tlbex.c, and 
I don't see the code that does that for TLBRefill exceptions.


Where is it done for interrupts?  I would expect code in 
arch/mips/kernel/genex.S and/or stackframe.h would handle this.  But I 
don't see where it is.


Am I missing something?

David Daney

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/18] KVM/MIPS32: Support for the new Virtualization ASE (VZ-ASE)

2013-05-20 Thread Sanjay Lal


On May 20, 2013, at 8:50 AM, David Daney wrote:

> On 05/18/2013 10:47 PM, Sanjay Lal wrote:
>> The following patch set adds support for the recently announced 
>> virtualization
>> extensions for the MIPS32 architecture and allows running unmodified kernels 
>> in
>> Guest Mode.
>> 
>> For more info please refer to :
>>  MIPS Document #: MD00846
>>  Volume IV-i: Virtualization Module of the MIPS32 Architecture
>> 
>> which can be accessed @: 
>> http://www.mips.com/auth/MD00846-2B-VZMIPS32-AFP-01.03.pdf
>> 
>> The patch is agains Linux-3.10-rc1.
>> 
>> KVM/MIPS now supports 2 modes of operation:
>> 
>> (1) VZ mode: Unmodified kernels running in Guest Mode.  The processor now 
>> provides
>> an almost complete COP0 context in Guest mode. This greatly reduces VM 
>> exits.
> 
> Two questions:
> 
> 1) How are you handling not clobbering the Guest K0/K1 registers when a Root 
> exception occurs?  It is not obvious to me from inspecting the code.
> 
> 2) What environment are you using to test this stuff?
> 
> David Daney
> 

(1) Newer versions of the MIPS architecture define scratch registers for just 
this purpose, but since we have to support standard MIPS32R2 processors, we use 
the DDataLo Register (CP0 Register 28, Select 3) as a scratch register to save 
k0 and save k1 @ a known offset from EBASE.

(2) Platforms that we've tested on:

KVM Trap & Emulate
- Malta Board with FPGA based 34K
- Sigma Designs TangoX board with a 24K based 8654 SoC.
- Malta Board with 74K @ 1GHz
- QEMU (as of 1.4.90)
- Imperas M*SDK MIPS32 simulator

KVM MIPS/VZ
- Imperas M*SDK MIPS32 simulator + MIPS/VZ model.

Regards
Sanjay

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 12/32] arm64: KVM: kvm_arch and kvm_vcpu_arch definitions

2013-05-20 Thread Catalin Marinas

On Tue, May 14, 2013 at 03:13:40PM +0100, Marc Zyngier wrote:
> Provide the architecture dependent structures for VM and
> vcpu abstractions.
> 
> Reviewed-by: Christopher Covington 
> Signed-off-by: Marc Zyngier 

Reviewed-by: Catalin Marinas 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 11/32] arm64: KVM: virtual CPU reset

2013-05-20 Thread Catalin Marinas

On Tue, May 14, 2013 at 03:13:39PM +0100, Marc Zyngier wrote:
> Provide the reset code for a virtual CPU booted in 64bit mode.
> 
> Reviewed-by: Christopher Covington 
> Signed-off-by: Marc Zyngier 

Reviewed-by: Catalin Marinas 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 07/32] arm64: KVM: architecture specific MMU backend

2013-05-20 Thread Marc Zyngier

On 20/05/13 17:25, Catalin Marinas wrote:
> On Mon, May 20, 2013 at 05:17:31PM +0100, Marc Zyngier wrote:
>> On 20/05/13 16:57, Catalin Marinas wrote:
>>> On Tue, May 14, 2013 at 03:13:35PM +0100, Marc Zyngier wrote:
 +static inline bool kvm_is_write_fault(unsigned long esr)
 +{
 +  unsigned long esr_ec = esr >> ESR_EL2_EC_SHIFT;
>>>
>>> Not that it would make much difference but for consistency - we use esr
>>> as an 'unsigned int' in the arm64 code (only 32-bit needed).
>>
>> Indeed. Changing it would require a 32bit patch though, as I'd like to
>> keep the two code bases in line.
>>
>> If you don't mind, I'll queue a patch changing this to "unsigned int" on
>> both architectures once this is merged.
> 
> I'm OK if you push the patch afterwards, you can add my reviewed-by on
> this patch.

Sure, will do.

> Reviewed-by: Catalin Marinas 

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 10/32] arm64: KVM: CPU specific system registers handling

2013-05-20 Thread Catalin Marinas

On Tue, May 14, 2013 at 03:13:38PM +0100, Marc Zyngier wrote:
> Add the support code for CPU specific system registers. Not much
> here yet.
> 
> Reviewed-by: Christopher Covington 
> Signed-off-by: Marc Zyngier 

Reviewed-by: Catalin Marinas 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 07/32] arm64: KVM: architecture specific MMU backend

2013-05-20 Thread Catalin Marinas

On Mon, May 20, 2013 at 05:17:31PM +0100, Marc Zyngier wrote:
> On 20/05/13 16:57, Catalin Marinas wrote:
> > On Tue, May 14, 2013 at 03:13:35PM +0100, Marc Zyngier wrote:
> >> +static inline bool kvm_is_write_fault(unsigned long esr)
> >> +{
> >> +  unsigned long esr_ec = esr >> ESR_EL2_EC_SHIFT;
> > 
> > Not that it would make much difference but for consistency - we use esr
> > as an 'unsigned int' in the arm64 code (only 32-bit needed).
> 
> Indeed. Changing it would require a 32bit patch though, as I'd like to
> keep the two code bases in line.
> 
> If you don't mind, I'll queue a patch changing this to "unsigned int" on
> both architectures once this is merged.

I'm OK if you push the patch afterwards, you can add my reviewed-by on
this patch.

Reviewed-by: Catalin Marinas 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 09/32] arm64: KVM: system register handling

2013-05-20 Thread Catalin Marinas

On Tue, May 14, 2013 at 03:13:37PM +0100, Marc Zyngier wrote:
> Provide 64bit system register handling, modeled after the cp15
> handling for ARM.
> 
> Reviewed-by: Christopher Covington 
> Signed-off-by: Marc Zyngier 

Reviewed-by: Catalin Marinas 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 07/32] arm64: KVM: architecture specific MMU backend

2013-05-20 Thread Marc Zyngier

On 20/05/13 16:57, Catalin Marinas wrote:
> On Tue, May 14, 2013 at 03:13:35PM +0100, Marc Zyngier wrote:
>> +static inline bool kvm_is_write_fault(unsigned long esr)
>> +{
>> +unsigned long esr_ec = esr >> ESR_EL2_EC_SHIFT;
> 
> Not that it would make much difference but for consistency - we use esr
> as an 'unsigned int' in the arm64 code (only 32-bit needed).

Indeed. Changing it would require a 32bit patch though, as I'd like to
keep the two code bases in line.

If you don't mind, I'll queue a patch changing this to "unsigned int" on
both architectures once this is merged.

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 08/32] arm64: KVM: user space interface

2013-05-20 Thread Catalin Marinas

On Tue, May 14, 2013 at 03:13:36PM +0100, Marc Zyngier wrote:
> Provide the kvm.h file that defines the user space visible
> interface.
> 
> Reviewed-by: Christopher Covington 
> Signed-off-by: Marc Zyngier 

Reviewed-by: Catalin Marinas 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 07/32] arm64: KVM: architecture specific MMU backend

2013-05-20 Thread Catalin Marinas

On Tue, May 14, 2013 at 03:13:35PM +0100, Marc Zyngier wrote:
> +static inline bool kvm_is_write_fault(unsigned long esr)
> +{
> + unsigned long esr_ec = esr >> ESR_EL2_EC_SHIFT;

Not that it would make much difference but for consistency - we use esr
as an 'unsigned int' in the arm64 code (only 32-bit needed).

-- 
Catalin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 06/32] arm64: KVM: fault injection into a guest

2013-05-20 Thread Catalin Marinas

On Tue, May 14, 2013 at 03:13:34PM +0100, Marc Zyngier wrote:
> Implement the injection of a fault (undefined, data abort or
> prefetch abort) into a 64bit guest.
> 
> Reviewed-by: Christopher Covington 
> Signed-off-by: Marc Zyngier 

Reviewed-by: Catalin Marinas 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/18] KVM/MIPS32: Support for the new Virtualization ASE (VZ-ASE)

2013-05-20 Thread David Daney


On 05/18/2013 10:47 PM, Sanjay Lal wrote:

The following patch set adds support for the recently announced virtualization
extensions for the MIPS32 architecture and allows running unmodified kernels in
Guest Mode.

For more info please refer to :
MIPS Document #: MD00846
Volume IV-i: Virtualization Module of the MIPS32 Architecture

which can be accessed @: 
http://www.mips.com/auth/MD00846-2B-VZMIPS32-AFP-01.03.pdf

The patch is agains Linux-3.10-rc1.

KVM/MIPS now supports 2 modes of operation:

(1) VZ mode: Unmodified kernels running in Guest Mode.  The processor now 
provides
 an almost complete COP0 context in Guest mode. This greatly reduces VM 
exits.


Two questions:

1) How are you handling not clobbering the Guest K0/K1 registers when a 
Root exception occurs?  It is not obvious to me from inspecting the code.


2) What environment are you using to test this stuff?

David Daney




(2) Trap and Emulate: Runs minimally modified guest kernels in UM and uses 
binary patching
 to minimize the number of traps and improve performance. This is used for 
processors
 that do not support the VZ-ASE.

--
Sanjay Lal (18):
   Revert "MIPS: microMIPS: Support dynamic ASID sizing."
   Revert "MIPS: Allow ASID size to be determined at boot time."
   KVM/MIPS32: Export min_low_pfn.
   KVM/MIPS32-VZ: MIPS VZ-ASE related register defines and helper
 macros.
   KVM/MIPS32-VZ: VZ-ASE assembler wrapper functions to set GuestIDs
   KVM/MIPS32-VZ: VZ-ASE related callbacks to handle guest exceptions
 that trap to the Root context.
   KVM/MIPS32: VZ-ASE related CPU feature flags and options.
   KVM/MIPS32-VZ: Entry point for trampolining to the guest and trap
 handlers.
   KVM/MIPS32-VZ: Add support for CONFIG_KVM_MIPS_VZ option
   KVM/MIPS32-VZ: Add API for VZ-ASE Capability
   KVM/MIPS32-VZ: VZ: Handle Guest TLB faults that are handled in Root
 context
   KVM/MIPS32-VZ: VM Exit Stats, add VZ exit reasons.
   KVM/MIPS32-VZ: Top level handler for Guest faults
   KVM/MIPS32-VZ: Guest exception batching support.
   KVM/MIPS32: Add dummy trap handler to catch unexpected exceptions and
 dump out useful info
   KVM/MIPS32-VZ: Add VZ-ASE support to KVM/MIPS data structures.
   KVM/MIPS32: Revert to older method for accessing ASID parameters
   KVM/MIPS32-VZ: Dump out additional info about VZ features as part of
 /proc/cpuinfo

  arch/mips/include/asm/cpu-features.h |   36 ++
  arch/mips/include/asm/cpu-info.h |   21 +
  arch/mips/include/asm/cpu.h  |5 +
  arch/mips/include/asm/kvm_host.h |  244 ++--
  arch/mips/include/asm/mipsvzregs.h   |  494 +++
  arch/mips/include/asm/mmu_context.h  |   95 ++-
  arch/mips/kernel/genex.S |2 +-
  arch/mips/kernel/mips_ksyms.c|6 +
  arch/mips/kernel/proc.c  |   11 +
  arch/mips/kernel/smtc.c  |   10 +-
  arch/mips/kernel/traps.c |6 +-
  arch/mips/kvm/Kconfig|   14 +-
  arch/mips/kvm/Makefile   |   14 +-
  arch/mips/kvm/kvm_locore.S   | 1088 ++
  arch/mips/kvm/kvm_mips.c |   73 ++-
  arch/mips/kvm/kvm_mips_dyntrans.c|   24 +-
  arch/mips/kvm/kvm_mips_emul.c|  236 
  arch/mips/kvm/kvm_mips_int.h |5 +
  arch/mips/kvm/kvm_mips_stats.c   |   17 +-
  arch/mips/kvm/kvm_tlb.c  |  444 +++---
  arch/mips/kvm/kvm_trap_emul.c|   68 ++-
  arch/mips/kvm/kvm_vz.c   |  786 
  arch/mips/kvm/kvm_vz_locore.S|   74 +++
  arch/mips/lib/dump_tlb.c |5 +-
  arch/mips/lib/r3k_dump_tlb.c |7 +-
  arch/mips/mm/tlb-r3k.c   |   20 +-
  arch/mips/mm/tlb-r4k.c   |2 +-
  arch/mips/mm/tlb-r8k.c   |2 +-
  arch/mips/mm/tlbex.c |   82 +--
  include/uapi/linux/kvm.h |1 +
  30 files changed, 2906 insertions(+), 986 deletions(-)
  create mode 100644 arch/mips/include/asm/mipsvzregs.h
  create mode 100644 arch/mips/kvm/kvm_vz.c
  create mode 100644 arch/mips/kvm/kvm_vz_locore.S



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 05/32] arm64: KVM: Basic ESR_EL2 helpers and vcpu register access

2013-05-20 Thread Marc Zyngier

On 20/05/13 16:41, Catalin Marinas wrote:
> On Tue, May 14, 2013 at 03:13:33PM +0100, Marc Zyngier wrote:
>> Implements helpers for dealing with the EL2 syndrome register as
>> well as accessing the vcpu registers.
>>
>> Reviewed-by: Christopher Covington 
>> Signed-off-by: Marc Zyngier 
> ...
>> +static inline bool kvm_vcpu_dabt_isvalid(const struct kvm_vcpu *vcpu)
>> +{
>> +return !!(kvm_vcpu_get_hsr(vcpu) & ESR_EL2_ISV);
>> +}
> 
> You kept these '!!' ;). BTW, would the compiler handle the conversion
> between the integer and bool here?

I like them very much. The compiler handles the conversion perfectly
fine, but the '!!' makes the intent explicit, and someone reviewing the
code won't be tempted to change the return type to u64 or something similar.

> Either way is fine by me:
> 
> Reviewed-by: Catalin Marinas 

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 02/32] arm64: KVM: HYP mode idmap support

2013-05-20 Thread Marc Zyngier

On 20/05/13 16:31, Catalin Marinas wrote:
> On Tue, May 14, 2013 at 03:13:30PM +0100, Marc Zyngier wrote:
>> Add the necessary infrastructure for identity-mapped HYP page
>> tables. Idmap-ed code must be in the ".hyp.idmap.text" linker
>> section.
>>
>> The rest of the HYP ends up in ".hyp.text".
>>
>> Signed-off-by: Marc Zyngier 
> ...
>> +/*
>> + * The HYP init code can't be more than a page long.
>> + */
>> +ASSERT(((__hyp_idmap_text_start + PAGE_SIZE) >= __hyp_idmap_text_end),
>> +   "HYP init code too big")
> 
> Is __hyp_idmap_text_end inclusive or exclusive? I think the latter and
> you can use greater than (without equal).

Indeed. I'll update the check.

> Otherwise:
> 
> Reviewed-by: Catalin Marinas 

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 05/32] arm64: KVM: Basic ESR_EL2 helpers and vcpu register access

2013-05-20 Thread Catalin Marinas

On Tue, May 14, 2013 at 03:13:33PM +0100, Marc Zyngier wrote:
> Implements helpers for dealing with the EL2 syndrome register as
> well as accessing the vcpu registers.
> 
> Reviewed-by: Christopher Covington 
> Signed-off-by: Marc Zyngier 
...
> +static inline bool kvm_vcpu_dabt_isvalid(const struct kvm_vcpu *vcpu)
> +{
> + return !!(kvm_vcpu_get_hsr(vcpu) & ESR_EL2_ISV);
> +}

You kept these '!!' ;). BTW, would the compiler handle the conversion
between the integer and bool here?

Either way is fine by me:

Reviewed-by: Catalin Marinas 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 04/32] arm64: KVM: system register definitions for 64bit guests

2013-05-20 Thread Catalin Marinas

On Tue, May 14, 2013 at 03:13:32PM +0100, Marc Zyngier wrote:
> Define the saved/restored registers for 64bit guests.
> 
> Reviewed-by: Christopher Covington 
> Signed-off-by: Marc Zyngier 

Reviewed-by: Catalin Marinas 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 03/32] arm64: KVM: EL2 register definitions

2013-05-20 Thread Catalin Marinas

On Tue, May 14, 2013 at 03:13:31PM +0100, Marc Zyngier wrote:
> Define all the useful bitfields for EL2 registers.
> 
> Reviewed-by: Christopher Covington 
> Signed-off-by: Marc Zyngier 

Reviewed-by: Catalin Marinas 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 02/32] arm64: KVM: HYP mode idmap support

2013-05-20 Thread Catalin Marinas

On Tue, May 14, 2013 at 03:13:30PM +0100, Marc Zyngier wrote:
> Add the necessary infrastructure for identity-mapped HYP page
> tables. Idmap-ed code must be in the ".hyp.idmap.text" linker
> section.
> 
> The rest of the HYP ends up in ".hyp.text".
> 
> Signed-off-by: Marc Zyngier 
...
> +/*
> + * The HYP init code can't be more than a page long.
> + */
> +ASSERT(((__hyp_idmap_text_start + PAGE_SIZE) >= __hyp_idmap_text_end),
> +   "HYP init code too big")

Is __hyp_idmap_text_end inclusive or exclusive? I think the latter and
you can use greater than (without equal).

Otherwise:

Reviewed-by: Catalin Marinas 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH tip/core/rcu 2/2] powerpc,kvm: fix imbalance srcu_read_[un]lock()

2013-05-20 Thread Paul E. McKenney

From: Lai Jiangshan 

At the point of up_out label in kvmppc_hv_setup_htab_rma(),
srcu read lock is still held.

We have to release it before return.

Signed-off-by: Lai Jiangshan 
Cc: Marcelo Tosatti 
Cc: Gleb Natapov 
Cc: Alexander Graf 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: kvm@vger.kernel.org
Cc: kvm-...@vger.kernel.org
Signed-off-by: Paul E. McKenney 
Reviewed-by: Josh Triplett 
---
 arch/powerpc/kvm/book3s_hv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 9de24f8..ec4c28a 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1862,7 +1862,7 @@ static int kvmppc_hv_setup_htab_rma(struct kvm_vcpu *vcpu)
 
  up_out:
up_read(¤t->mm->mmap_sem);
-   goto out;
+   goto out_srcu;
 }
 
 int kvmppc_core_init_vm(struct kvm *kvm)
-- 
1.8.1.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 01/32] arm64: KVM: define HYP and Stage-2 translation page flags

2013-05-20 Thread Catalin Marinas

On Tue, May 14, 2013 at 03:13:29PM +0100, Marc Zyngier wrote:
> Add HYP and S2 page flags, for both normal and device memory.
> 
> Reviewed-by: Christopher Covington 
> Signed-off-by: Marc Zyngier 

Reviewed-by: Catalin Marinas 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM call agenda for 2013-05-21

2013-05-20 Thread Michael S. Tsirkin

On Mon, May 20, 2013 at 12:57:47PM +0200, Juan Quintela wrote:
> 
> Hi
> 
> Please, send any topic that you are interested in covering.
> 
> Thanks, Juan.

Generating acpi tables.

Cc'd a bunch of people who might be interested in this topic.

Kevin - could you join on Tuesday? There appears a disconnect
between the seabios and qemu that a conf call
might help resolve.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 08/13] nEPT: Some additional comments

2013-05-20 Thread Paolo Bonzini

Il 19/05/2013 06:52, Jun Nakajima ha scritto:
> From: Nadav Har'El 
> 
> Some additional comments to preexisting code:
> Explain who (L0 or L1) handles EPT violation and misconfiguration exits.
> Don't mention "shadow on either EPT or shadow" as the only two options.
> 
> Signed-off-by: Nadav Har'El 
> Signed-off-by: Jun Nakajima 
> Signed-off-by: Xinhao Xu 
> ---
>  arch/x86/kvm/vmx.c | 13 +
>  1 file changed, 13 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index b79efd4..4661a22 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -6540,7 +6540,20 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu 
> *vcpu)
>   return nested_cpu_has2(vmcs12,
>   SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
>   case EXIT_REASON_EPT_VIOLATION:
> + /*
> +  * L0 always deals with the EPT violation. If nested EPT is
> +  * used, and the nested mmu code discovers that the address is
> +  * missing in the guest EPT table (EPT12), the EPT violation
> +  * will be injected with nested_ept_inject_page_fault()
> +  */
> + return 0;
>   case EXIT_REASON_EPT_MISCONFIG:
> + /*
> +  * L2 never uses directly L1's EPT, but rather L0's own EPT
> +  * table (shadow on EPT) or a merged EPT table that L0 built
> +  * (EPT on EPT). So any problems with the structure of the
> +  * table is L0's fault.
> +  */
>   return 0;
>   case EXIT_REASON_PREEMPTION_TIMER:
>   return vmcs12->pin_based_vm_exec_control &
> 

Reviewed-by: Paolo Bonzini 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 06/13] nEPT: Fix cr3 handling in nested exit and entry

2013-05-20 Thread Paolo Bonzini

Il 19/05/2013 06:52, Jun Nakajima ha scritto:
> From: Nadav Har'El 
> 
> The existing code for handling cr3 and related VMCS fields during nested
> exit and entry wasn't correct in all cases:
> 
> If L2 is allowed to control cr3 (and this is indeed the case in nested EPT),
> during nested exit we must copy the modified cr3 from vmcs02 to vmcs12, and
> we forgot to do so. This patch adds this copy.
> 
> If L0 isn't controlling cr3 when running L2 (i.e., L0 is using EPT), and
> whoever does control cr3 (L1 or L2) is using PAE, the processor might have
> saved PDPTEs and we should also save them in vmcs12 (and restore later).
> 
> Signed-off-by: Nadav Har'El 
> Signed-off-by: Jun Nakajima 
> Signed-off-by: Xinhao Xu 
> ---
>  arch/x86/kvm/vmx.c | 30 ++
>  1 file changed, 30 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index a88432f..b79efd4 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -7608,6 +7608,17 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, 
> struct vmcs12 *vmcs12)
>   kvm_set_cr3(vcpu, vmcs12->guest_cr3);
>   kvm_mmu_reset_context(vcpu);
>  
> + /*
> +  * Additionally, except when L0 is using shadow page tables, L1 or
> +  * L2 control guest_cr3 for L2, so they may also have saved PDPTEs
> +  */
> + if (enable_ept) {
> + vmcs_write64(GUEST_PDPTR0, vmcs12->guest_pdptr0);
> + vmcs_write64(GUEST_PDPTR1, vmcs12->guest_pdptr1);
> + vmcs_write64(GUEST_PDPTR2, vmcs12->guest_pdptr2);
> + vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3);
> + }
> +
>   kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12->guest_rsp);
>   kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12->guest_rip);
>  }
> @@ -7930,6 +7941,25 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, 
> struct vmcs12 *vmcs12)
>   vmcs12->guest_pending_dbg_exceptions =
>   vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
>  
> + /*
> +  * In some cases (usually, nested EPT), L2 is allowed to change its
> +  * own CR3 without exiting. If it has changed it, we must keep it.
> +  * Of course, if L0 is using shadow page tables, GUEST_CR3 was defined
> +  * by L0, not L1 or L2, so we mustn't unconditionally copy it to vmcs12.
> +  */
> + if (enable_ept)
> + vmcs12->guest_cr3 = vmcs_read64(GUEST_CR3);
> + /*
> +  * Additionally, except when L0 is using shadow page tables, L1 or
> +  * L2 control guest_cr3 for L2, so save their PDPTEs
> +  */
> + if (enable_ept) {
> + vmcs12->guest_pdptr0 = vmcs_read64(GUEST_PDPTR0);
> + vmcs12->guest_pdptr1 = vmcs_read64(GUEST_PDPTR1);
> + vmcs12->guest_pdptr2 = vmcs_read64(GUEST_PDPTR2);
> + vmcs12->guest_pdptr3 = vmcs_read64(GUEST_PDPTR3);
> + }
> +
>   vmcs12->vm_entry_controls =
>   (vmcs12->vm_entry_controls & ~VM_ENTRY_IA32E_MODE) |
>   (vmcs_read32(VM_ENTRY_CONTROLS) & VM_ENTRY_IA32E_MODE);
> 

Reviewed-by: Paolo Bonzini 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 07/13] nEPT: Fix wrong test in kvm_set_cr3

2013-05-20 Thread Paolo Bonzini

Il 19/05/2013 06:52, Jun Nakajima ha scritto:
> From: Nadav Har'El 
> 
> kvm_set_cr3() attempts to check if the new cr3 is a valid guest physical
> address. The problem is that with nested EPT, cr3 is an *L2* physical
> address, not an L1 physical address as this test expects.
> 
> As the comment above this test explains, it isn't necessary, and doesn't
> correspond to anything a real processor would do. So this patch removes it.
> 
> Note that this wrong test could have also theoretically caused problems
> in nested NPT, not just in nested EPT. However, in practice, the problem
> was avoided: nested_svm_vmexit()/vmrun() do not call kvm_set_cr3 in the
> nested NPT case, and instead set the vmcb (and arch.cr3) directly, thus
> circumventing the problem. Additional potential calls to the buggy function
> are avoided in that we don't trap cr3 modifications when nested NPT is
> enabled. However, because in nested VMX we did want to use kvm_set_cr3()
> (as requested in Avi Kivity's review of the original nested VMX patches),
> we can't avoid this problem and need to fix it

Makes sense, but did you test what happens (without nesting, and both
with/without EPT) if L1 points CR3 at an invalid physical address?  Does
a basic level of sanity remain?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 13/13] nEPT: Inject EPT violation/misconfigration

2013-05-20 Thread Paolo Bonzini

Il 19/05/2013 06:52, Jun Nakajima ha scritto:
> @@ -7441,10 +7443,81 @@ static void nested_ept_inject_page_fault(struct 
> kvm_vcpu *vcpu,
>* Note no need to set vmcs12->vm_exit_reason as it is already copied
>* from vmcs02 in nested_vmx_vmexit() above, i.e., EPT_VIOLATION.
>*/

This comment is now wrong.

> - vmcs12->exit_qualification = fault->error_code;

And this shows that patch 5 ("nEPT: MMU context for nested EPT") was
wrong in this respect.  Perhaps this patch should be moved earlier in
the series, so that the exit qualification is "bisectably" ok.

1) the updating of exit_qualification in walk_addr_generic should be
split out and moved before patch 5;

2) the changes to handle_ept_violation and nested_ept_inject_page_fault
(plus fixing the above comment) should also be split out, this time to
squash them in patch 5.  These two changes ensure that patch 5 can
already use the right exit qualification.

3) if needed to make the series bisectable, squash patch 12 into patch 2
and make is_rsvd_bits_set always return 0 in patch 3; then the rest of
the handling of reserved bits (including the introduction of
check_tdp_pte) will remain here. Otherwise, just squash what's left of
this patch into patch 12 and again change the subject.   In either case
the subject will have to change.

Paolo

> + if (fault->error_code & PFERR_RSVD_MASK)
> + vmcs12->vm_exit_reason = EXIT_REASON_EPT_MISCONFIG;
> + else
> + vmcs12->vm_exit_reason = EXIT_REASON_EPT_VIOLATION;
> +
> + vmcs12->exit_qualification = vcpu->arch.exit_qualification;
>   vmcs12->guest_physical_address = fault->address;
>  }
>  

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 09/13] nEPT: Advertise EPT to L1

2013-05-20 Thread Paolo Bonzini

Il 19/05/2013 06:52, Jun Nakajima ha scritto:
> From: Nadav Har'El 
> 
> Advertise the support of EPT to the L1 guest, through the appropriate MSR.
> 
> This is the last patch of the basic Nested EPT feature, so as to allow
> bisection through this patch series: The guest will not see EPT support until
> this last patch, and will not attempt to use the half-applied feature.
> 
> Signed-off-by: Nadav Har'El 
> Signed-off-by: Jun Nakajima 
> Signed-off-by: Xinhao Xu 
> ---
>  arch/x86/include/asm/vmx.h |  2 ++
>  arch/x86/kvm/vmx.c | 17 +++--
>  2 files changed, 17 insertions(+), 2 deletions(-)

This patch is ok, but it must be placed after patch 10 ("nEPT: Nested
INVEPT").

Paolo

> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index f3e01a2..4aec45d 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -394,7 +394,9 @@ enum vmcs_field {
>  #define VMX_EPTP_WB_BIT  (1ull << 14)
>  #define VMX_EPT_2MB_PAGE_BIT (1ull << 16)
>  #define VMX_EPT_1GB_PAGE_BIT (1ull << 17)
> +#define VMX_EPT_INVEPT_BIT   (1ull << 20)
>  #define VMX_EPT_AD_BIT   (1ull << 21)
> +#define VMX_EPT_EXTENT_INDIVIDUAL_BIT(1ull << 24)
>  #define VMX_EPT_EXTENT_CONTEXT_BIT   (1ull << 25)
>  #define VMX_EPT_EXTENT_GLOBAL_BIT(1ull << 26)
>  
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 4661a22..1cf8a41 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2155,6 +2155,7 @@ static u32 nested_vmx_pinbased_ctls_low, 
> nested_vmx_pinbased_ctls_high;
>  static u32 nested_vmx_exit_ctls_low, nested_vmx_exit_ctls_high;
>  static u32 nested_vmx_entry_ctls_low, nested_vmx_entry_ctls_high;
>  static u32 nested_vmx_misc_low, nested_vmx_misc_high;
> +static u32 nested_vmx_ept_caps;
>  static __init void nested_vmx_setup_ctls_msrs(void)
>  {
>   /*
> @@ -2242,6 +2243,18 @@ static __init void nested_vmx_setup_ctls_msrs(void)
>   SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
>   SECONDARY_EXEC_WBINVD_EXITING;
>  
> + if (enable_ept) {
> + /* nested EPT: emulate EPT also to L1 */
> + nested_vmx_secondary_ctls_high |= SECONDARY_EXEC_ENABLE_EPT;
> + nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT;
> + nested_vmx_ept_caps |=
> + VMX_EPT_INVEPT_BIT | VMX_EPT_EXTENT_GLOBAL_BIT |
> + VMX_EPT_EXTENT_CONTEXT_BIT |
> + VMX_EPT_EXTENT_INDIVIDUAL_BIT;
> + nested_vmx_ept_caps &= vmx_capability.ept;
> + } else
> + nested_vmx_ept_caps = 0;
> +
>   /* miscellaneous data */
>   rdmsr(MSR_IA32_VMX_MISC, nested_vmx_misc_low, nested_vmx_misc_high);
>   nested_vmx_misc_low &= VMX_MISC_PREEMPTION_TIMER_RATE_MASK |
> @@ -2347,8 +2360,8 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 
> msr_index, u64 *pdata)
>   nested_vmx_secondary_ctls_high);
>   break;
>   case MSR_IA32_VMX_EPT_VPID_CAP:
> - /* Currently, no nested ept or nested vpid */
> - *pdata = 0;
> + /* Currently, no nested vpid support */
> + *pdata = nested_vmx_ept_caps;
>   break;
>   default:
>   return 0;
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 10/13] nEPT: Nested INVEPT

2013-05-20 Thread Paolo Bonzini

Il 19/05/2013 06:52, Jun Nakajima ha scritto:
> + switch (type) {
> + case VMX_EPT_EXTENT_GLOBAL:
> + if (!(nested_vmx_ept_caps & VMX_EPT_EXTENT_GLOBAL_BIT))
> + nested_vmx_failValid(vcpu,
> + VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
> + else {
> + /*
> +  * Do nothing: when L1 changes EPT12, we already
> +  * update EPT02 (the shadow EPT table) and call INVEPT.
> +  * So when L1 calls INVEPT, there's nothing left to do.
> +  */
> + nested_vmx_succeed(vcpu);
> + }
> + break;

Duplicate code:

switch (type) {
case VMX_EPT_EXTENT_GLOBAL
ok = (nested_vmx_ept_caps & VMX_EPT_EXTENT_GLOBAL_BIT) != 0;
break;
...
default:
ok = false;
break;
}
if (ok) {
/* Do nothing: ... */
nested_vmx_succeed(vcpu);
} else {
nested_vmx_failValid(vcpu, ...);
}
break;

Paolo

> + case VMX_EPT_EXTENT_CONTEXT:
> + if (!(nested_vmx_ept_caps & VMX_EPT_EXTENT_CONTEXT_BIT))
> + nested_vmx_failValid(vcpu,
> + VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
> + else {
> + /* Do nothing */
> + nested_vmx_succeed(vcpu);
> + }
> + break;
> + case VMX_EPT_EXTENT_INDIVIDUAL_ADDR:
> + if (!(nested_vmx_ept_caps & VMX_EPT_EXTENT_INDIVIDUAL_BIT))
> + nested_vmx_failValid(vcpu,
> + VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
> + else {
> + /* Do nothing */
> + nested_vmx_succeed(vcpu);
> + }
> + break;
> + default:
> + nested_vmx_failValid(vcpu,
> + VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 04/13] nEPT: Define EPT-specific link_shadow_page()

2013-05-20 Thread Paolo Bonzini

Il 19/05/2013 06:52, Jun Nakajima ha scritto:
> From: Nadav Har'El 
> 
> Since link_shadow_page() is used by a routine in mmu.c, add an
> EPT-specific link_shadow_page() in paging_tmp.h, rather than moving
> it.
> 
> Signed-off-by: Nadav Har'El 
> Signed-off-by: Jun Nakajima 
> Signed-off-by: Xinhao Xu 
> ---
>  arch/x86/kvm/paging_tmpl.h | 20 
>  1 file changed, 20 insertions(+)
> 
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 4c45654..dc495f9 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -461,6 +461,18 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, 
> struct guest_walker *gw,
>   }
>  }
>  
> +#if PTTYPE == PTTYPE_EPT
> +static void FNAME(link_shadow_page)(u64 *sptep, struct kvm_mmu_page *sp)
> +{
> + u64 spte;
> +
> + spte = __pa(sp->spt) | VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK |
> + VMX_EPT_EXECUTABLE_MASK;
> +
> + mmu_spte_set(sptep, spte);
> +}
> +#endif

The function is small enough that likely the compiler will inline it.
You can just handle it unconditionally with FNAME().

Paolo

>  /*
>   * Fetch a shadow pte for a specific level in the paging hierarchy.
>   * If the guest tries to write a write-protected page, we need to
> @@ -513,7 +525,11 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t 
> addr,
>   goto out_gpte_changed;
>  
>   if (sp)
> +#if PTTYPE == PTTYPE_EPT
> + FNAME(link_shadow_page)(it.sptep, sp);
> +#else
>   link_shadow_page(it.sptep, sp);
> +#endif
>   }
>  
>   for (;
> @@ -533,7 +549,11 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t 
> addr,
>  
>   sp = kvm_mmu_get_page(vcpu, direct_gfn, addr, it.level-1,
> true, direct_access, it.sptep);
> +#if PTTYPE == PTTYPE_EPT
> + FNAME(link_shadow_page)(it.sptep, sp);
> +#else
>   link_shadow_page(it.sptep, sp);
> +#endif
>   }
>  
>   clear_sp_write_flooding_count(it.sptep);
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 02/13] nEPT: Move gpte_access() and prefetch_invalid_gpte() to paging_tmpl.h

2013-05-20 Thread Paolo Bonzini

Il 19/05/2013 06:52, Jun Nakajima ha scritto:
> From: Nadav Har'El 
> 
> For preparation, we just move gpte_access() and prefetch_invalid_gpte() from 
> mmu.c to paging_tmpl.h.
> 
> Signed-off-by: Nadav Har'El 
> Signed-off-by: Jun Nakajima 
> Signed-off-by: Xinhao Xu 
> ---
>  arch/x86/kvm/mmu.c | 30 --
>  arch/x86/kvm/paging_tmpl.h | 40 +++-
>  2 files changed, 35 insertions(+), 35 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 004cc87..117233f 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2488,26 +2488,6 @@ static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu 
> *vcpu, gfn_t gfn,
>   return gfn_to_pfn_memslot_atomic(slot, gfn);
>  }
>  
> -static bool prefetch_invalid_gpte(struct kvm_vcpu *vcpu,
> -   struct kvm_mmu_page *sp, u64 *spte,
> -   u64 gpte)
> -{
> - if (is_rsvd_bits_set(&vcpu->arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
> - goto no_present;
> -
> - if (!is_present_gpte(gpte))
> - goto no_present;
> -
> - if (!(gpte & PT_ACCESSED_MASK))
> - goto no_present;
> -
> - return false;
> -
> -no_present:
> - drop_spte(vcpu->kvm, spte);
> - return true;
> -}
> -
>  static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
>   struct kvm_mmu_page *sp,
>   u64 *start, u64 *end)
> @@ -3408,16 +3388,6 @@ static bool sync_mmio_spte(u64 *sptep, gfn_t gfn, 
> unsigned access,
>   return false;
>  }
>  
> -static inline unsigned gpte_access(struct kvm_vcpu *vcpu, u64 gpte)
> -{
> - unsigned access;
> -
> - access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
> - access &= ~(gpte >> PT64_NX_SHIFT);
> -
> - return access;
> -}
> -
>  static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, 
> unsigned gpte)
>  {
>   unsigned index;
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index da20860..df34d4a 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -103,6 +103,36 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, 
> struct kvm_mmu *mmu,
>   return (ret != orig_pte);
>  }
>  
> +static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
> +   struct kvm_mmu_page *sp, u64 *spte,
> +   u64 gpte)
> +{
> + if (is_rsvd_bits_set(&vcpu->arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
> + goto no_present;
> +
> + if (!is_present_gpte(gpte))
> + goto no_present;
> +
> + if (!(gpte & PT_ACCESSED_MASK))
> + goto no_present;
> +
> + return false;
> +
> +no_present:
> + drop_spte(vcpu->kvm, spte);
> + return true;
> +}
> +
> +static inline unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, u64 gpte)
> +{
> + unsigned access;
> +
> + access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
> + access &= ~(gpte >> PT64_NX_SHIFT);
> +
> + return access;
> +}
> +
>  static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
>struct kvm_mmu *mmu,
>struct guest_walker *walker,
> @@ -225,7 +255,7 @@ retry_walk:
>   }
>  
>   accessed_dirty &= pte;
> - pte_access = pt_access & gpte_access(vcpu, pte);
> + pte_access = pt_access & FNAME(gpte_access)(vcpu, pte);
>  
>   walker->ptes[walker->level - 1] = pte;
>   } while (!is_last_gpte(mmu, walker->level, pte));
> @@ -309,13 +339,13 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct 
> kvm_mmu_page *sp,
>   gfn_t gfn;
>   pfn_t pfn;
>  
> - if (prefetch_invalid_gpte(vcpu, sp, spte, gpte))
> + if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte))
>   return false;
>  
>   pgprintk("%s: gpte %llx spte %p\n", __func__, (u64)gpte, spte);
>  
>   gfn = gpte_to_gfn(gpte);
> - pte_access = sp->role.access & gpte_access(vcpu, gpte);
> + pte_access = sp->role.access & FNAME(gpte_access)(vcpu, gpte);
>   protect_clean_gpte(&pte_access, gpte);
>   pfn = pte_prefetch_gfn_to_pfn(vcpu, gfn,
>   no_dirty_log && (pte_access & ACC_WRITE_MASK));
> @@ -782,14 +812,14 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, 
> struct kvm_mmu_page *sp)
> sizeof(pt_element_t)))
>   return -EINVAL;
>  
> - if (prefetch_invalid_gpte(vcpu, sp, &sp->spt[i], gpte)) {
> + if (FNAME(prefetch_invalid_gpte)(vcpu, sp, &sp->spt[i], gpte)) {
>   vcpu->kvm->tlbs_dirty++;
>   continue;
>   }
>  
>   gfn = gpte_to_gfn(gpte);
>   pte_access = sp->role.access;
> - pte_acc

Re: [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1

2013-05-20 Thread Paolo Bonzini

Il 19/05/2013 06:52, Jun Nakajima ha scritto:
> From: Nadav Har'El 
> 
> Recent KVM, since http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
> switch the EFER MSR when EPT is used and the host and guest have different
> NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2)
> and want to be able to run recent KVM as L1, we need to allow L1 to use this
> EFER switching feature.
> 
> To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if available,
> and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds
> support for the former (the latter is still unsupported).
> 
> Nested entry and exit emulation (prepare_vmcs_02 and load_vmcs12_host_state,
> respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So all
> that's left to do in this patch is to properly advertise this feature to L1.
> 
> Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by using
> vmx_set_efer (which itself sets one of several vmcs02 fields), so we always
> support this feature, regardless of whether the host supports it.
> 
> Signed-off-by: Nadav Har'El 
> Signed-off-by: Jun Nakajima 
> Signed-off-by: Xinhao Xu 
> ---
>  arch/x86/kvm/vmx.c | 23 ---
>  1 file changed, 16 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 260a919..fb9cae5 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2192,7 +2192,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
>  #else
>   nested_vmx_exit_ctls_high = 0;
>  #endif
> - nested_vmx_exit_ctls_high |= VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
> + nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
> +   VM_EXIT_LOAD_IA32_EFER);
>  
>   /* entry controls */
>   rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
> @@ -2201,8 +2202,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
>   nested_vmx_entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
>   nested_vmx_entry_ctls_high &=
>   VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE;
> - nested_vmx_entry_ctls_high |= VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
> -
> + nested_vmx_entry_ctls_high |= (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR |
> +VM_ENTRY_LOAD_IA32_EFER);
>   /* cpu-based controls */
>   rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
>   nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
> @@ -7492,10 +7493,18 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, 
> struct vmcs12 *vmcs12)
>   vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask;
>   vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
>  
> - /* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer below */
> - vmcs_write32(VM_EXIT_CONTROLS,
> - vmcs12->vm_exit_controls | vmcs_config.vmexit_ctrl);
> - vmcs_write32(VM_ENTRY_CONTROLS, vmcs12->vm_entry_controls |
> + /* L2->L1 exit controls are emulated - the hardware exit is to L0 so
> +  * we should use its exit controls. Note that IA32_MODE, LOAD_IA32_EFER
> +  * bits are further modified by vmx_set_efer() below.
> +  */
> + vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
> +
> + /* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE are
> +  * emulated by vmx_set_efer(), below.

VM_ENTRY_LOAD_IA32_EFER is not emulated by vmx_set_efer, so:

/* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE
 * are emulated below.  VM_ENTRY_IA32E_MODE is handled in
 * vmx_set_efer().  */

Paolo

> +  */
> + vmcs_write32(VM_ENTRY_CONTROLS,
> + (vmcs12->vm_entry_controls & ~VM_ENTRY_LOAD_IA32_EFER &
> + ~VM_ENTRY_IA32E_MODE) |
>   (vmcs_config.vmentry_ctrl & ~VM_ENTRY_IA32E_MODE));
>  
>   if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PAT)
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] emulator: add MUL tests

2013-05-20 Thread Gleb Natapov

On Sat, Feb 09, 2013 at 11:32:25AM +0200, Avi Kivity wrote:
> Signed-off-by: Avi Kivity 
Applied, thanks.

> ---
>  x86/emulator.c | 19 ---
>  1 file changed, 16 insertions(+), 3 deletions(-)
> 
> diff --git a/x86/emulator.c b/x86/emulator.c
> index a128e13..96576e5 100644
> --- a/x86/emulator.c
> +++ b/x86/emulator.c
> @@ -583,9 +583,9 @@ static void test_imul(ulong *mem)
>  report("imul rax, mem, imm", a == 0x1D950BDE1D950BC8L);
>  }
>  
> -static void test_div(long *mem)
> +static void test_muldiv(long *mem)
>  {
> -long a, d;
> +long a, d, aa, dd;
>  u8 ex = 1;
>  
>  *mem = 0; a = 1; d = 2;
> @@ -598,6 +598,19 @@ static void test_div(long *mem)
>: "+a"(a), "+d"(d), "+q"(ex) : "m"(*mem));
>  report("divq (1)",
>  a == 0x1ffb1b963b33ul && d == 0x273ba4384ede2ul && !ex);
> +aa = 0x; dd = 0x;
> +*mem = 0x; a = aa; d = dd;
> +asm("mulb %2" : "+a"(a), "+d"(d) : "m"(*mem));
> +report("mulb mem", a == 0x0363 && d == dd);
> +*mem = 0x; a = aa; d = dd;
> +asm("mulw %2" : "+a"(a), "+d"(d) : "m"(*mem));
> +report("mulw mem", a == 0xc963 && d == 0x0369);
> +*mem = 0x; a = aa; d = dd;
> +asm("mull %2" : "+a"(a), "+d"(d) : "m"(*mem));
> +report("mull mem", a == 0x962fc963 && d == 0x369d036);
> +*mem = 0x; a = aa; d = dd;
> +asm("mulq %2" : "+a"(a), "+d"(d) : "m"(*mem));
> +report("mulq mem", a == 0x2fc962fc962fc963 && d == 0x369d0369d0369d0);
>  }
>  
>  typedef unsigned __attribute__((vector_size(16))) sse128;
> @@ -934,7 +947,7 @@ int main()
>   test_btc(mem);
>   test_bsfbsr(mem);
>   test_imul(mem);
> - test_div(mem);
> + test_muldiv(mem);
>   test_sse(mem);
>   test_mmx(mem);
>   test_rip_relative(mem, insn_ram);
> -- 
> 1.8.1.2

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/8] KVM: x86 emulator: decode extended accumulator explicity

2013-05-20 Thread Gleb Natapov

On Sun, Feb 10, 2013 at 02:19:29PM +0200, Gleb Natapov wrote:
> On Sat, Feb 09, 2013 at 11:31:45AM +0200, Avi Kivity wrote:
> > Single-operand MUL and DIV access an extended accumulator: AX for byte
> > instructions, and DX:AX, EDX:EAX, or RDX:RAX for larger-sized instructions.
> > Add support for fetching the extended accumulator.
> > 
> Where this "extended accumulator" definition is coming from? Spec just
> says that two registers are used for a result depending on the operand
> size.
> 
> > In order not to change things too much, RDX is loaded into Src2, which is
> > already loaded by fastop().  This avoids increasing register pressure on
> > i386.
> > 
> > Signed-off-by: Avi Kivity 
> > ---
> >  arch/x86/kvm/emulate.c | 22 ++
> >  1 file changed, 22 insertions(+)
> > 
> > diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> > index 18c86b5..aa8516e 100644
> > --- a/arch/x86/kvm/emulate.c
> > +++ b/arch/x86/kvm/emulate.c
> > @@ -60,6 +60,8 @@
> >  #define OpGS  25ull  /* GS */
> >  #define OpMem826ull  /* 8-bit zero extended memory operand */
> >  #define OpImm64   27ull  /* Sign extended 16/32/64-bit immediate */
> > +#define OpAccLo   29ull  /* Low part of extended acc 
> > (AX/AX/EAX/RAX) */
> > +#define OpAccHi   30ull  /* High part of extended acc 
> > (-/DX/EDX/RDX) */
> >  
> >  #define OpBits 5  /* Width of operand field */
> >  #define OpMask ((1ull << OpBits) - 1)
> > @@ -85,6 +87,7 @@
> >  #define DstMem64(OpMem64 << DstShift)
> >  #define DstImmUByte (OpImmUByte << DstShift)
> >  #define DstDX   (OpDX << DstShift)
> > +#define DstAccLo(OpAccLo << DstShift)
> >  #define DstMask (OpMask << DstShift)
> >  /* Source operand type. */
> >  #define SrcShift6
> > @@ -106,6 +109,7 @@
> >  #define SrcImm64(OpImm64 << SrcShift)
> >  #define SrcDX   (OpDX << SrcShift)
> >  #define SrcMem8 (OpMem8 << SrcShift)
> > +#define SrcAccHi(OpAccHi << SrcShift)
> >  #define SrcMask (OpMask << SrcShift)
> >  #define BitOp   (1<<11)
> >  #define MemAbs  (1<<12)  /* Memory operand is absolute 
> > displacement */
> > @@ -154,6 +158,8 @@
> >  #define NoWrite ((u64)1 << 45)  /* No writeback */
> >  #define SrcWrite((u64)1 << 46)  /* Write back src operand */
> >  
> > +#define DstXacc (DstAccLo | SrcAccHi | SrcWrite)
> > +
> >  #define X2(x...) x, x
> >  #define X3(x...) X2(x), x
> >  #define X4(x...) X2(x), X2(x)
> > @@ -4129,6 +4135,22 @@ static int decode_operand(struct x86_emulate_ctxt 
> > *ctxt, struct operand *op,
> > fetch_register_operand(op);
> > op->orig_val = op->val;
> > break;
> > +   case OpAccLo:
> > +   op->type = OP_REG;
> > +   op->bytes = (ctxt->d & ByteOp) ? 2 : ctxt->op_bytes;
> > +   op->addr.reg = reg_rmw(ctxt, VCPU_REGS_RAX);
> > +   fetch_register_operand(op);
> > +   op->orig_val = op->val;
> > +   break;
> > +   case OpAccHi:
> > +   if (ctxt->d & ByteOp)
> > +   break;
> Why would we set OpAccHi if ByteOp is set in decode tables in the first
> place?
> 
Was trying to create decode tables so that this one will not be
necessary, but it will require to have two group3 arrays, one for ByteOp
and another for non ByteOp. Not nice.

But what I noticed is that current code writes into NULL pointer when
ByteOp mul/div is emulated. It happens because src operand stays
undecoded, but since SrcWrite is set for the instruction src operand is
written back anyway. Since it is undecoded op->type is zero and
op->addr.reg is NULL and type zero happens to be OP_REG, so
write_register_operand() is called with op->addr.reg == NULL.

We should do op->type = OP_NONE if !ByteOp here.

Will fix and apply, try to come up with a way to express it via decode
tables on top.

> > +   op->type = OP_REG;
> > +   op->bytes = ctxt->op_bytes;
> > +   op->addr.reg = reg_rmw(ctxt, VCPU_REGS_RDX);
> > +   fetch_register_operand(op);
> > +   op->orig_val = op->val;
> > +   break;
> > case OpDI:
> > op->type = OP_MEM;
> > op->bytes = (ctxt->d & ByteOp) ? 1 : ctxt->op_bytes;
> > -- 
> > 1.8.1.2
> 
> --
>   Gleb.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] VFIO VGA test branches

2013-05-20 Thread Maik Broemme

Hi Alex,

Alex Williamson  wrote:
> On Sun, 2013-05-19 at 23:26 +0400, Maik Broemme wrote:
> > Hi Knut,
> > 
> > Knut Omang  wrote:
> > > 
> > > On Mon, 2013-05-13 at 16:23 -0600, Alex Williamson wrote:
> > > > On Mon, 2013-05-13 at 22:55 +0200, Knut Omang wrote:
> > > > > Hi all,
> > > > > 
> > > > > Perfect timing from my perspective, thanks Alex!
> > > > > 
> > > > > I spent the better part of the weekend testing your branches on a new 
> > > > > system 
> > > > > I just put together for this purpose, results below..
> > > > > 
> > > > > On Fri, 2013-05-03 at 16:56 -0600, Alex Williamson wrote:
> > > > > ...
> > > > > > git://github.com/awilliam/linux-vfio.git vfio-vga-reset
> > > > > > git://github.com/awilliam/qemu-vfio.git vfio-vga-reset
> > > > > 
> > > > > System setup: 
> > > > > 
> > > > > - Fedora 18 on
> > > > > - Gigabyte Z77X-UD5H motherboard
> > > > > - Intel Core i7 3770 (Ivy bridge w/integrated graphics)
> > > > > - 2 discrete graphics cards:
> > > > > 
> > > > > lspci | egrep 'VGA|Audio'
> > > > > 00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 
> > > > > v2/3rd Gen Core processor Graphics Controller (rev 09)
> > > > > 00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset 
> > > > > Family High Definition Audio Controller (rev 04)
> > > > > 01:00.0 VGA compatible controller: Advanced Micro Devices [AMD] nee 
> > > > > ATI Caicos [Radeon HD 6450]
> > > > > 01:00.1 Audio device: Advanced Micro Devices [AMD] nee ATI Caicos 
> > > > > HDMI Audio [Radeon HD 6400 Series]
> > > > > 02:00.0 VGA compatible controller: Advanced Micro Devices [AMD] nee 
> > > > > ATI Cape Verde PRO [Radeon HD 7700 Series]
> > > > > 02:00.1 Audio device: Advanced Micro Devices [AMD] nee ATI Cape 
> > > > > Verde/Pitcairn HDMI Audio [Radeon HD 7700/7800 Series]
> > > > > 
> > > > > Short summary:
> > > > > 
> > > > > - Once I got past a few time consuming obstacles explained below
> > > > >- the graphics part of the graphics/hdmi audio passthrough seems 
> > > > > to work perfect
> > > > >  on both discrete graphics cards 
> > > > >  (though so far only one at at time and with some minor issues, 
> > > > > see below)
> > > > >- no success with the hdmi audio yet (ideas for further 
> > > > > investigation appreciated!)
> > > > 
> > > > I've had hdmi audio working with an HD7850, but only in Windows (7) and
> > > > it was using legacy interrupts for some reason instead of MSI.  I wonder
> > > > if Liux guests might work with snd_hda_intel.enable_msi=0.  I'm not sure
> > > > what's wrong with MSI, but it seems to be new with the PCI bus reset
> > > > support.
> > > 
> > > In my first tries, Windows were just using a generic
> > > VGA driver, which still seems to work perfect with reboots and everything 
> > > and in full screen resolution (1920x1200).
> > > However after installing the Catalyst AMD driver stack, upon boot
> > > Windows 7 now consequently get a BSOD from the graphics driver
> > > with the message:
> > > 
> > > "Attempt to reset the display driver and recover from timeout failed"
> > > - a picture of the BSOD screen attached.
> > > 
> > > I attach the corresponding vfio log where I added some timing code to
> > > make it easier to see when the BSOD happens (with 2 seconds of silence
> > > in the log before the VM reboots, I believe this is at 09:28:32-34 in
> > > the log.
> > > 
> > > Similar behaviour both just after reboot/power cycle of the host and
> > > subsequent VM boot attempts.
> > > 
> > > This is still with the HD7700 as passed through device, but after a
> > > motherboard firmware upgrade (to F14) which did not seem to affect the
> > > observed behaviour on Windows prior to Catalyst install or with Linux
> > > guest, neither did it fix the bug in selecting primary devices as I 
> > > was hoping for.
> > > 
> > > Let me know if you have ideas for further debugging this,
> > > 
> > 
> > I had a similar problem a couple of days ago and posted it in this list.
> > I got similar BSOD and tested already the following configurations:
> > 
> > 1) machine: q35 / kvm: yes / vga: none   / x-vga: on  = qemu freeze
> > 2) machine: q35 / kvm: no  / vga: none   / x-vga: on  = qemu freeze
> >(with errors below)
> > 3) machine: q35 / kvm: yes / vga: none   / x-vga: off = qemu runs with
> >100% CPU due to no VGA init (no picture)
> > 4) machine: q35 / kvm: yes / vga: cirrus / x-vga: off = qemu runs with
> >BOSD on loading atikmpag.sys
> > 5) machine: pc  / kvm: yes / vga: cirrus / x-vga: off = qemu runs fine
> > 
> > However I've re-run the BSOD case already with the following branches
> > from Alex:
> > 
> > git://github.com/awilliam/linux-vfio.git vfio-vga-reset
> > git://github.com/awilliam/qemu-vfio.git vfio-vga-reset
> > 
> > Also with latest seabios and it worked so far. No more BSOD and reboot
> > of VM was also possible without suspend / resume the host between.
> 
> Good to hear.  It looks like you have the same motherboard as my AMD
> test s

KVM call agenda for 2013-05-21

2013-05-20 Thread Juan Quintela


Hi

Please, send any topic that you are interested in covering.

Thanks, Juan.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/2] Hyper-H reference counter

2013-05-20 Thread Vadim Rozenfeld

On Mon, 2013-05-20 at 13:27 +0300, Gleb Natapov wrote:
> On Mon, May 20, 2013 at 08:25:11PM +1000, Vadim Rozenfeld wrote:
> > On Mon, 2013-05-20 at 12:25 +0300, Gleb Natapov wrote:
> > > On Mon, May 20, 2013 at 10:56:22AM +0200, Paolo Bonzini wrote:
> > > > In any case, this patch needs more comments and a better commit message.
> > > >  Microsoft docs are decent, but there are several non-obvious points in
> > > > how the patches were done, and they need to be documented.
> > > I wish you were right about Microsoft docs :) So in Hyper-V spec they
> > > say:
> > > 
> > >   Special value of 0x is used to indicate that this facility is no
> > >   longer a reliable source of reference time and the virtual machine must
> > >   fall back to a different source (for example, the virtual PM timer).
> > > 
> > > May be they really mean "virtual PM timer" here and reference counter is
> > > not considered as a fall back source, but this is not what we want.
> > 
> > As far as I know, you cannot fall back from iTSC to PMTimer or HPET,
> > but you can fallback to reference counters.
> > 
> What if you put 0x as a sequence? Or is this another case where
> the spec is wrong.
> 
it will use PMTimer (maybe HPET if you have it) if you specify it on
VM's start up. But I'm not sure if it will work if you migrate from TSC
or reference counter to 0x  
> > > 
> > > On the other hand in API specification [1] they have:
> > > 
> > > #define HV_REFERENCE_TSC_SEQUENCE_INVALID   (0x)
> > > 
> > > which is not even documented in hyper-v spec. Actually 0 is specified as
> > > valid value there. Go figure.
> > > 
> > > [1] 
> > > http://msdn.microsoft.com/en-us/library/windows/hardware/ff540244%28v=vs.85%29.aspx
> > > 
> > > --
> > >   Gleb.
> > 
> 
> --
>   Gleb.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/2] Hyper-H reference counter

2013-05-20 Thread Gleb Natapov

On Mon, May 20, 2013 at 08:25:11PM +1000, Vadim Rozenfeld wrote:
> On Mon, 2013-05-20 at 12:25 +0300, Gleb Natapov wrote:
> > On Mon, May 20, 2013 at 10:56:22AM +0200, Paolo Bonzini wrote:
> > > In any case, this patch needs more comments and a better commit message.
> > >  Microsoft docs are decent, but there are several non-obvious points in
> > > how the patches were done, and they need to be documented.
> > I wish you were right about Microsoft docs :) So in Hyper-V spec they
> > say:
> > 
> >   Special value of 0x is used to indicate that this facility is no
> >   longer a reliable source of reference time and the virtual machine must
> >   fall back to a different source (for example, the virtual PM timer).
> > 
> > May be they really mean "virtual PM timer" here and reference counter is
> > not considered as a fall back source, but this is not what we want.
> 
> As far as I know, you cannot fall back from iTSC to PMTimer or HPET,
> but you can fallback to reference counters.
> 
What if you put 0x as a sequence? Or is this another case where
the spec is wrong.

> > 
> > On the other hand in API specification [1] they have:
> > 
> > #define HV_REFERENCE_TSC_SEQUENCE_INVALID   (0x)
> > 
> > which is not even documented in hyper-v spec. Actually 0 is specified as
> > valid value there. Go figure.
> > 
> > [1] 
> > http://msdn.microsoft.com/en-us/library/windows/hardware/ff540244%28v=vs.85%29.aspx
> > 
> > --
> > Gleb.
> 

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/2] Hyper-H reference counter

2013-05-20 Thread Vadim Rozenfeld

On Mon, 2013-05-20 at 12:25 +0300, Gleb Natapov wrote:
> On Mon, May 20, 2013 at 10:56:22AM +0200, Paolo Bonzini wrote:
> > In any case, this patch needs more comments and a better commit message.
> >  Microsoft docs are decent, but there are several non-obvious points in
> > how the patches were done, and they need to be documented.
> I wish you were right about Microsoft docs :) So in Hyper-V spec they
> say:
> 
>   Special value of 0x is used to indicate that this facility is no
>   longer a reliable source of reference time and the virtual machine must
>   fall back to a different source (for example, the virtual PM timer).
> 
> May be they really mean "virtual PM timer" here and reference counter is
> not considered as a fall back source, but this is not what we want.

As far as I know, you cannot fall back from iTSC to PMTimer or HPET,
but you can fallback to reference counters.

> 
> On the other hand in API specification [1] they have:
> 
> #define HV_REFERENCE_TSC_SEQUENCE_INVALID   (0x)
> 
> which is not even documented in hyper-v spec. Actually 0 is specified as
> valid value there. Go figure.
> 
> [1] 
> http://msdn.microsoft.com/en-us/library/windows/hardware/ff540244%28v=vs.85%29.aspx
> 
> --
>   Gleb.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/2] Hyper-H reference counter

2013-05-20 Thread Vadim Rozenfeld

On Mon, 2013-05-20 at 10:56 +0200, Paolo Bonzini wrote:
> Il 20/05/2013 10:49, Gleb Natapov ha scritto:
> > On Mon, May 20, 2013 at 10:42:52AM +0200, Paolo Bonzini wrote:
> >> Il 20/05/2013 10:36, Gleb Natapov ha scritto:
> >>> On Mon, May 20, 2013 at 10:05:38AM +0200, Paolo Bonzini wrote:
>  Il 19/05/2013 08:37, Vadim Rozenfeld ha scritto:
> > On Thu, 2013-05-16 at 16:45 +0200, Paolo Bonzini wrote:
> >> Il 16/05/2013 16:26, Vadim Rozenfeld ha scritto:
> >>>
> >>> Yes, I have this check added in the second patch.
> >>>
> > Move it here please.
> >>> OK, will do it.
> >
> > Or better, remove all the handling of HV_X64_MSR_REFERENCE_TSC from 
> > this
> > patch, and leave it all to the second.
> >
> >>> What for? Could you please elaborate?
> >
> > To make code reviewable.  Add one MSR here, the other in the second 
> > patch.
> > removing HV_X64_MSR_REFERENCE_TSC will make this particular patch
> > completely non-functional.
> 
>  Do you mean Windows guest will BSOD or just that they won't use the
>  reference TSC?  If the latter, it's not a problem.
> 
> >>> I think it is. If reference counter works without TSC we have a bisect
> >>> point for the case when something will going wrong with TSC.
> >>
> >> Isn't that exactly what might happen with this patch only?  Windows will
> >> not use the TSC because it finds invalid values in the TSC page.
> > 
> > Yes, it will use reference counter instead. Exactly what we want for a 
> > bisect point.
> > 
> >>  If it
> >> still uses the reference counter, we have the situation you describe.
> >>
> >> refcountTSC page
> >> Y  Y   <= after patch 2
> >> Y  N   <= after patch 1
> >> N  Y   <= impossible
> >> N  N   <= removing TSC page from this patch?
> >>
> >> Of course if the guest BSODs, it's not possible to split the patches
> >> that way.  Perhaps in that case it's simply better to do a single patch.
> >>
> > I am not sure what you are trying to say. Your option list above shows
> > that there is a value to split patches like they are split now.
> 
> Hmm, we're talking past each other. :)
> 
> I put the "?" because that's what Vadim implied ("it would make this
> particular patch non-functional"), but I don't see why it should be like
> this.  To me, the obvious way of getting the desired bisect point is
> implementing one MSR per patch.  So, moving the REFERENCE_TSC handling
> entirely to patch 2 would still be in the "refcount=Y, TSC page=N" case.
> 
> In any case, this patch needs more comments and a better commit message.
>  Microsoft docs are decent, but there are several non-obvious points in
> how the patches were done, and they need to be documented.
We need specify two partition privileges to activate reference time
enlightenment in HYPERV_CPUID_FEATURES (0x4003)
AccessPartitionReferenceCounter and AccessPartitionReferenceTsc
otherwise VM will use HPET or PMTimer as a timestamp source.
If we specify AccessPartitionReferenceTsc but don't handle write 
request to HV_X64_MSR_REFERENCE_TSC - the system will fail with 0x78
(PHASE0_EXCEPTION) bugcheck code. If we provide HV_X64_MSR_REFERENCE_TSC
handler but don't initialize sequence to 0 - guest will probably newer
start or will be extremely slow, because in this case scale should also
be initialized. Sequence 0 is a special case, it means use reference
counter, but not TSC, as a time source. It is also a fallback solution
in a case when a VM, which using TSC has been migrated to a host, which
is not equipped with invariant TSC.  

 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/2] Hyper-H reference counter

2013-05-20 Thread Peter Lieven

Hi all,

sorry that I am a bit unresponsive about this series. I have a few days off and 
can't spend much time in this.
If I read that the REFERENCE TSC breaks migration I don't think its a good 
option to include it at all.
I have this hyperv_refcnt MSR in an internal patch I sent over about 1.5 years 
ago and its working flawlessly
with Win2k8R2, Win7, Win8 + Win2012. I set the reference TSC to 0x00 and this 
seems to work with all
the above Windows versions. Some of the early Alphas of Windows 8 didn't work 
with this patch, but the
final is running smoothly also with migration etc.

I crafted this patch to avoid the heavy calls to PM Timer during high I/O which 
slowed down Windows
approx. by 30% compared to Hyper-V.

I reinclude this patch for reference. Its unchanged since mid 2012 and it might 
not apply.

Cheers,
Peter

diff -Npur kvm-kmod-3.3/include/asm-x86/hyperv.h 
kvm-kmod-3.3-hyperv-refcnt/include/asm-x86/hyperv.h
--- kvm-kmod-3.3/include/asm-x86/hyperv.h   2012-03-19 23:00:49.0 
+0100
+++ kvm-kmod-3.3-hyperv-refcnt/include/asm-x86/hyperv.h 2012-03-28 
12:23:02.0 +0200
@@ -169,7 +169,8 @@

/* MSR used to read the per-partition time reference counter */
#define HV_X64_MSR_TIME_REF_COUNT   0x4020
-
+#define HV_X64_MSR_REFERENCE_TSC   0x4021
+
/* Define the virtual APIC registers */
#define HV_X64_MSR_EOI  0x4070
#define HV_X64_MSR_ICR  0x4071
diff -Npur kvm-kmod-3.3/include/asm-x86/kvm_host.h 
kvm-kmod-3.3-hyperv-refcnt/include/asm-x86/kvm_host.h
--- kvm-kmod-3.3/include/asm-x86/kvm_host.h 2012-03-19 23:00:49.0 
+0100
+++ kvm-kmod-3.3-hyperv-refcnt/include/asm-x86/kvm_host.h   2012-03-28 
15:08:24.0 +0200
@@ -553,6 +553,8 @@ struct kvm_arch {
/* fields used by HYPER-V emulation */
u64 hv_guest_os_id;
u64 hv_hypercall;
+   u64 hv_ref_count;
+   u64 hv_reference_tsc;

atomic_t reader_counter;

diff -Npur kvm-kmod-3.3/x86/x86.c kvm-kmod-3.3-hyperv-refcnt/x86/x86.c
--- kvm-kmod-3.3/x86/x86.c  2012-03-19 23:00:56.0 +0100
+++ kvm-kmod-3.3-hyperv-refcnt/x86/x86.c2012-03-28 16:27:46.0 
+0200
@@ -826,7 +826,7 @@ EXPORT_SYMBOL_GPL(kvm_rdpmc);
static u32 msrs_to_save[] = {
MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
-   HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
+   HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL, HV_X64_MSR_TIME_REF_COUNT,
HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
MSR_STAR,
@@ -1387,6 +1387,8 @@ static bool kvm_hv_msr_partition_wide(u3
switch (msr) {
case HV_X64_MSR_GUEST_OS_ID:
case HV_X64_MSR_HYPERCALL:
+   case HV_X64_MSR_REFERENCE_TSC:
+   case HV_X64_MSR_TIME_REF_COUNT:
r = true;
break;
}
@@ -1426,6 +1428,21 @@ static int set_msr_hyperv_pw(struct kvm_
if (__copy_to_user((void   *)addr, instructions, 4))
return 1;
kvm->arch.hv_hypercall = data;
+   kvm->arch.hv_ref_count = get_kernel_ns();
+   break;
+   }
+   case HV_X64_MSR_REFERENCE_TSC: {
+   u64 gfn;
+   unsigned long addr;
+   u32 hv_tsc_sequence;
+   gfn = data >> HV_X64_MSR_HYPERCALL_PAGE_ADDRESS_SHIFT;
+   addr = gfn_to_hva(kvm, gfn);
+   if (kvm_is_error_hva(addr))
+   return 1;
+   hv_tsc_sequence = 0x0; //invalid
+   if (__copy_to_user((void   *)addr, (void __user *) 
&hv_tsc_sequence, sizeof(hv_tsc_sequence)))
+   return 1;   
+   kvm->arch.hv_reference_tsc = data;
break;
}
default:
@@ -1826,6 +1843,17 @@ static int get_msr_hyperv_pw(struct kvm_
case HV_X64_MSR_HYPERCALL:
data = kvm->arch.hv_hypercall;
break;
+   case HV_X64_MSR_TIME_REF_COUNT: {
+   u64 now_ns;
+   local_irq_disable();
+   now_ns = get_kernel_ns();
+   data = div_u64(now_ns + kvm->arch.kvmclock_offset - 
kvm->arch.hv_ref_count,100); 
+   local_irq_enable();
+   break;
+   }
+   case HV_X64_MSR_REFERENCE_TSC:
+   data = kvm->arch.hv_reference_tsc;
+   break;
default:
pr_unimpl(vcpu, "Hyper-V unhandled rdmsr: 0x%x\n", msr);
return 1;


Am 20.05.2013 um 11:41 schrieb Gleb Natapov :

> On Mon, May 20, 2013 at 11:32:27AM +0200, Paolo Bonzini wrote:
>> Il 20/05/2013 11:25, Gleb Natapov ha scritto:
>>> So in Hyper-V spec they
>>> say:
>>> 
>>>  Special value of 0x is used to indicate that this facili

Re: [PATCH] virtio-net: Reporting traffic queue distribution statistics through ethtool

2013-05-20 Thread Michael S. Tsirkin

On Sun, May 19, 2013 at 10:56:16PM +, Narasimhan, Sriram wrote:
> Hi Michael,
> 
> Comments inline...
> 
> -Original Message-
> From: Michael S. Tsirkin [mailto:m...@redhat.com] 
> Sent: Sunday, May 19, 2013 1:03 PM
> To: Narasimhan, Sriram
> Cc: ru...@rustcorp.com.au; virtualizat...@lists.linux-foundation.org; 
> kvm@vger.kernel.org; net...@vger.kernel.org; linux-ker...@vger.kernel.org; 
> Jason Wang
> Subject: Re: [PATCH] virtio-net: Reporting traffic queue distribution 
> statistics through ethtool
> 
> On Sun, May 19, 2013 at 04:09:48PM +, Narasimhan, Sriram wrote:
> > Hi Michael,
> > 
> > I was getting all packets on the same inbound queue which
> > is why I added this support to virtio-net (and some more
> > instrumentation at tun as well).  But, it turned out to be
> > my misconfiguration - I did not enable IFF_MULTI_QUEUE on
> > the tap device, so the real_num_tx_queues on tap netdev was 
> > always 1 (no tx distribution at tap).
> 
> Interesting that qemu didn't fail.
> 
> [Sriram] void tun_set_real_num_tx_queues() does not return 
> the EINVAL return from netif_set_real_num_tx_queues() for 
> txq > dev->num_tx_queues (which would be the case if the
> tap device were not created with IFF_MULTI_QUEUE).  I think
> it would be better to fix the code to disable the new 
> queue and fail tun_attach()

You mean fail TUNSETQUEUE?

> in this scenario.  If you 
> agree, I can go ahead and create a separate patch for that.

Hmm I not sure I understand what happens, so hard for me to tell.

I think this code was supposed to handle it:
err = -EBUSY;
if (!(tun->flags & TUN_TAP_MQ) && tun->numqueues == 1)
goto out;

why doesn't it?


> 
> > I am thinking about 
> > adding a -q option to tunctl to specify multi-queue flag on
> > the tap device.
> 
> Absolutely.
> 
> [Sriram] OK, let me do that.
> 
> > Yes, number of exits will be most useful.  I will look into
> > adding the other statistics you mention.
> > 
> > Sriram  
> 
> Pls note you'll need to switch to virtqueue_kick_prepare
> to detect exits: virtqueue_kick doesn't let you know
> whether there was an exit.
> 
> Also, it's best to make this a separate patch from the one
> adding per-queue stats.
> 
> [Sriram] OK, I will cover only the per-queue statistics in
> this patch. Also, I will address the indentation/data 
> structure name points that you mentioned in your earlier
> email and send a new revision for this patch.
> 
> Sriram
> 
> > -Original Message-
> > From: Michael S. Tsirkin [mailto:m...@redhat.com] 
> > Sent: Sunday, May 19, 2013 4:28 AM
> > To: Narasimhan, Sriram
> > Cc: ru...@rustcorp.com.au; virtualizat...@lists.linux-foundation.org; 
> > kvm@vger.kernel.org; net...@vger.kernel.org; linux-ker...@vger.kernel.org
> > Subject: Re: [PATCH] virtio-net: Reporting traffic queue distribution 
> > statistics through ethtool
> > 
> > On Thu, May 16, 2013 at 01:24:29PM -0700, Sriram Narasimhan wrote:
> > > This patch allows virtio-net driver to report traffic distribution
> > > to inbound/outbound queues through ethtool -S.  The per_cpu
> > > virtnet_stats is split into receive and transmit stats and are
> > > maintained on a per receive_queue and send_queue basis.
> > > virtnet_stats() is modified to aggregate interface level statistics
> > > from per-queue statistics.  Sample output below:
> > > 
> > 
> > Thanks for the patch. The idea itself looks OK to me.
> > Ben Hutchings already sent some comments
> > so I won't repeat them. Some minor more comments
> > and questions below.
> > 
> > > NIC statistics:
> > >  rxq0: rx_packets: 4357802
> > >  rxq0: rx_bytes: 292642052
> > >  txq0: tx_packets: 824540
> > >  txq0: tx_bytes: 55256404
> > >  rxq1: rx_packets: 0
> > >  rxq1: rx_bytes: 0
> > >  txq1: tx_packets: 1094268
> > >  txq1: tx_bytes: 73328316
> > >  rxq2: rx_packets: 0
> > >  rxq2: rx_bytes: 0
> > >  txq2: tx_packets: 1091466
> > >  txq2: tx_bytes: 73140566
> > >  rxq3: rx_packets: 0
> > >  rxq3: rx_bytes: 0
> > >  txq3: tx_packets: 1093043
> > >  txq3: tx_bytes: 73246142
> > 
> > Interesting. This example implies that all packets are coming in
> > through the same RX queue - is this right?
> > If yes that's worth exploring - could be a tun bug -
> > and shows how this patch is useful.
> > 
> > > Signed-off-by: Sriram Narasimhan 
> > 
> > BTW, while you are looking at the stats, one other interesting
> > thing to add could be checking more types of stats: number of exits,
> > queue full errors, etc.
> > 
> > > ---
> > >  drivers/net/virtio_net.c |  157 
> > > +-
> > >  1 files changed, 128 insertions(+), 29 deletions(-)
> > > 
> > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > index 3c23fdc..3c58c52 100644
> > > --- a/drivers/net/virtio_net.c
> > > +++ b/drivers/net/virtio_net.c
> > > @@ -41,15 +41,46 @@ module_param(gso, bool, 0444);
> > >  
> > >  #define

Re: [PATCH v6 2/7] KVM: MMU: delete shadow page from hash list in kvm_mmu_prepare_zap_page

2013-05-20 Thread Gleb Natapov

On Mon, May 20, 2013 at 05:19:26PM +0800, Xiao Guangrong wrote:
> On 05/19/2013 06:47 PM, Gleb Natapov wrote:
> > On Fri, May 17, 2013 at 05:12:57AM +0800, Xiao Guangrong wrote:
> >> Move deletion shadow page from the hash list from kvm_mmu_commit_zap_page 
> >> to
> >> kvm_mmu_prepare_zap_page so that we can call kvm_mmu_commit_zap_page
> >> once for multiple kvm_mmu_prepare_zap_page that can help us to avoid
> >> unnecessary TLB flush
> >>
> > Don't we call kvm_mmu_commit_zap_page() once for multiple
> > kvm_mmu_prepare_zap_page() now when possible? kvm_mmu_commit_zap_page()
> > gets a list as a parameter. I am not against the change, but wish to
> > understand it better.
> 
> The changelong is not clear enough, i mean we can "call
> kvm_mmu_commit_zap_page once for multiple kvm_mmu_prepare_zap_page" when
> we use lock-break technique. If we do not do this, the page can be found
> in hashtable but they are linked on the invalid_list on other thread.
> 
Got it. Make sense.

> > 
> >> Signed-off-by: Xiao Guangrong 
> >> ---
> >>  arch/x86/kvm/mmu.c |8 ++--
> >>  1 files changed, 6 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >> index 40d7b2d..682ecb4 100644
> >> --- a/arch/x86/kvm/mmu.c
> >> +++ b/arch/x86/kvm/mmu.c
> >> @@ -1466,7 +1466,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm 
> >> *kvm, int nr)
> >>  static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
> >>  {
> >>ASSERT(is_empty_shadow_page(sp->spt));
> >> -  hlist_del(&sp->hash_link);
> >> +
> >>list_del(&sp->link);
> >>free_page((unsigned long)sp->spt);
> >>if (!sp->role.direct)
> >> @@ -1655,7 +1655,8 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
> >>  
> >>  #define for_each_gfn_indirect_valid_sp(_kvm, _sp, _gfn)   
> >> \
> >>for_each_gfn_sp(_kvm, _sp, _gfn)\
> >> -  if ((_sp)->role.direct || (_sp)->role.invalid) {} else
> >> +  if ((_sp)->role.direct ||   \
> >> +((_sp)->role.invalid && WARN_ON(1))) {} else
> >>  
> >>  /* @sp->gfn should be write-protected at the call site */
> >>  static int __kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> >> @@ -2074,6 +2075,9 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, 
> >> struct kvm_mmu_page *sp,
> >>unaccount_shadowed(kvm, sp->gfn);
> >>if (sp->unsync)
> >>kvm_unlink_unsync_page(kvm, sp);
> >> +
> >> +  hlist_del_init(&sp->hash_link);
> >> +
> > What about moving this inside if() bellow and making it hlist_del()?
> > Leave the page on the hash if root_count is non zero.
> > 
> 
> It's a good idea. will update.
> 

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/2] Hyper-H reference counter

2013-05-20 Thread Gleb Natapov

On Mon, May 20, 2013 at 11:32:27AM +0200, Paolo Bonzini wrote:
> Il 20/05/2013 11:25, Gleb Natapov ha scritto:
> > So in Hyper-V spec they
> > say:
> > 
> >   Special value of 0x is used to indicate that this facility is no
> >   longer a reliable source of reference time and the virtual machine must
> >   fall back to a different source (for example, the virtual PM timer).
> > 
> > May be they really mean "virtual PM timer" here and reference counter is
> > not considered as a fall back source, but this is not what we want.
> > 
> > On the other hand in API specification [1] they have:
> > 
> > #define HV_REFERENCE_TSC_SEQUENCE_INVALID   (0x)
> > 
> > which is not even documented in hyper-v spec. Actually 0 is specified as
> > valid value there. Go figure.
> > 
> > [1] 
> > http://msdn.microsoft.com/en-us/library/windows/hardware/ff540244%28v=vs.85%29.aspx
> 
> Ok, if the API document is right then we should use
> HV_REFERENCE_TSC_SEQUENCE_INVALID instead of 0, with a comment
> explaining why we use 0 and not 0x.
> 
Using define is always a good idea no matter if API doc is right or
hyper-v spec is, it's just the "decent documentation" part that got me :)
Definitely better than nothing and thanks them for that.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/2] Hyper-H reference counter

2013-05-20 Thread Paolo Bonzini

Il 20/05/2013 11:25, Gleb Natapov ha scritto:
> So in Hyper-V spec they
> say:
> 
>   Special value of 0x is used to indicate that this facility is no
>   longer a reliable source of reference time and the virtual machine must
>   fall back to a different source (for example, the virtual PM timer).
> 
> May be they really mean "virtual PM timer" here and reference counter is
> not considered as a fall back source, but this is not what we want.
> 
> On the other hand in API specification [1] they have:
> 
> #define HV_REFERENCE_TSC_SEQUENCE_INVALID   (0x)
> 
> which is not even documented in hyper-v spec. Actually 0 is specified as
> valid value there. Go figure.
> 
> [1] 
> http://msdn.microsoft.com/en-us/library/windows/hardware/ff540244%28v=vs.85%29.aspx

Ok, if the API document is right then we should use
HV_REFERENCE_TSC_SEQUENCE_INVALID instead of 0, with a comment
explaining why we use 0 and not 0x.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/2] Hyper-H reference counter

2013-05-20 Thread Gleb Natapov

On Mon, May 20, 2013 at 10:56:22AM +0200, Paolo Bonzini wrote:
> In any case, this patch needs more comments and a better commit message.
>  Microsoft docs are decent, but there are several non-obvious points in
> how the patches were done, and they need to be documented.
I wish you were right about Microsoft docs :) So in Hyper-V spec they
say:

  Special value of 0x is used to indicate that this facility is no
  longer a reliable source of reference time and the virtual machine must
  fall back to a different source (for example, the virtual PM timer).

May be they really mean "virtual PM timer" here and reference counter is
not considered as a fall back source, but this is not what we want.

On the other hand in API specification [1] they have:

#define HV_REFERENCE_TSC_SEQUENCE_INVALID   (0x)

which is not even documented in hyper-v spec. Actually 0 is specified as
valid value there. Go figure.

[1] 
http://msdn.microsoft.com/en-us/library/windows/hardware/ff540244%28v=vs.85%29.aspx

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v6 2/7] KVM: MMU: delete shadow page from hash list in kvm_mmu_prepare_zap_page

2013-05-20 Thread Xiao Guangrong

On 05/19/2013 06:47 PM, Gleb Natapov wrote:
> On Fri, May 17, 2013 at 05:12:57AM +0800, Xiao Guangrong wrote:
>> Move deletion shadow page from the hash list from kvm_mmu_commit_zap_page to
>> kvm_mmu_prepare_zap_page so that we can call kvm_mmu_commit_zap_page
>> once for multiple kvm_mmu_prepare_zap_page that can help us to avoid
>> unnecessary TLB flush
>>
> Don't we call kvm_mmu_commit_zap_page() once for multiple
> kvm_mmu_prepare_zap_page() now when possible? kvm_mmu_commit_zap_page()
> gets a list as a parameter. I am not against the change, but wish to
> understand it better.

The changelong is not clear enough, i mean we can "call
kvm_mmu_commit_zap_page once for multiple kvm_mmu_prepare_zap_page" when
we use lock-break technique. If we do not do this, the page can be found
in hashtable but they are linked on the invalid_list on other thread.

> 
>> Signed-off-by: Xiao Guangrong 
>> ---
>>  arch/x86/kvm/mmu.c |8 ++--
>>  1 files changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 40d7b2d..682ecb4 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -1466,7 +1466,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm 
>> *kvm, int nr)
>>  static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
>>  {
>>  ASSERT(is_empty_shadow_page(sp->spt));
>> -hlist_del(&sp->hash_link);
>> +
>>  list_del(&sp->link);
>>  free_page((unsigned long)sp->spt);
>>  if (!sp->role.direct)
>> @@ -1655,7 +1655,8 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
>>  
>>  #define for_each_gfn_indirect_valid_sp(_kvm, _sp, _gfn) 
>> \
>>  for_each_gfn_sp(_kvm, _sp, _gfn)\
>> -if ((_sp)->role.direct || (_sp)->role.invalid) {} else
>> +if ((_sp)->role.direct ||   \
>> +  ((_sp)->role.invalid && WARN_ON(1))) {} else
>>  
>>  /* @sp->gfn should be write-protected at the call site */
>>  static int __kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>> @@ -2074,6 +2075,9 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, 
>> struct kvm_mmu_page *sp,
>>  unaccount_shadowed(kvm, sp->gfn);
>>  if (sp->unsync)
>>  kvm_unlink_unsync_page(kvm, sp);
>> +
>> +hlist_del_init(&sp->hash_link);
>> +
> What about moving this inside if() bellow and making it hlist_del()?
> Leave the page on the hash if root_count is non zero.
> 

It's a good idea. will update.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages

2013-05-20 Thread Xiao Guangrong

On 05/19/2013 06:04 PM, Gleb Natapov wrote:

>> +/*
>> + * Do not repeatedly zap a root page to avoid unnecessary
>> + * KVM_REQ_MMU_RELOAD, otherwise we may not be able to
>> + * progress:
>> + *vcpu 0vcpu 1
>> + * call vcpu_enter_guest():
>> + *1): handle KVM_REQ_MMU_RELOAD
>> + *and require mmu-lock to
>> + *load mmu
>> + * repeat:
>> + *1): zap root page and
>> + *send KVM_REQ_MMU_RELOAD
>> + *
>> + *2): if (cond_resched_lock(mmu-lock))
>> + *
>> + *2): hold mmu-lock and load mmu
>> + *
>> + *3): see KVM_REQ_MMU_RELOAD bit
>> + *on vcpu->requests is set
>> + *then return 1 to call
>> + *vcpu_enter_guest() again.
>> + *goto repeat;
>> + *
>> + */
> I am not sure why the above scenario will prevent us from progressing.
> There is finite number of root pages with invalid generation number, so
> eventually we will zap them all and vcpu1 will stop seeing KVM_REQ_MMU_RELOAD
> request.

This patch does not "zap pages in batch", so kvm_zap_obsolete_pages() can
just zap invalid root pages and lock-break due to the lock contention on the
path of handing KVM_REQ_MMU_RELOAD.

Yes, after "zap pages in batch", this issue does not exist any more. I should
update this into that patch.

> 
> This check here prevent unnecessary KVM_REQ_MMU_RELOAD as you say, but
> this races the question, why don't we check for sp->role.invalid in
> kvm_mmu_prepare_zap_page before calling kvm_reload_remote_mmus()?
> Something like this:
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 40d7b2d..d2ae3a4 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2081,7 +2081,8 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, 
> struct kvm_mmu_page *sp,
>   kvm_mod_used_mmu_pages(kvm, -1);
>   } else {
>   list_move(&sp->link, &kvm->arch.active_mmu_pages);
> - kvm_reload_remote_mmus(kvm);
> + if (!sp->role.invalid)
> + kvm_reload_remote_mmus(kvm);
>   }
> 
>   sp->role.invalid = 1;

Yes, it is better.

> 
> Actually we can add check for is_obsolete_sp() there too since
> kvm_mmu_invalidate_all_pages() already calls kvm_reload_remote_mmus()
> after incrementing mmu_valid_gen.

Yes, I agree.

> 
> Or do I miss something?

No, you are right. ;)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/2] Hyper-H reference counter

2013-05-20 Thread Gleb Natapov

On Mon, May 20, 2013 at 10:56:22AM +0200, Paolo Bonzini wrote:
> Il 20/05/2013 10:49, Gleb Natapov ha scritto:
> > On Mon, May 20, 2013 at 10:42:52AM +0200, Paolo Bonzini wrote:
> >> Il 20/05/2013 10:36, Gleb Natapov ha scritto:
> >>> On Mon, May 20, 2013 at 10:05:38AM +0200, Paolo Bonzini wrote:
>  Il 19/05/2013 08:37, Vadim Rozenfeld ha scritto:
> > On Thu, 2013-05-16 at 16:45 +0200, Paolo Bonzini wrote:
> >> Il 16/05/2013 16:26, Vadim Rozenfeld ha scritto:
> >>>
> >>> Yes, I have this check added in the second patch.
> >>>
> > Move it here please.
> >>> OK, will do it.
> >
> > Or better, remove all the handling of HV_X64_MSR_REFERENCE_TSC from 
> > this
> > patch, and leave it all to the second.
> >
> >>> What for? Could you please elaborate?
> >
> > To make code reviewable.  Add one MSR here, the other in the second 
> > patch.
> > removing HV_X64_MSR_REFERENCE_TSC will make this particular patch
> > completely non-functional.
> 
>  Do you mean Windows guest will BSOD or just that they won't use the
>  reference TSC?  If the latter, it's not a problem.
> 
> >>> I think it is. If reference counter works without TSC we have a bisect
> >>> point for the case when something will going wrong with TSC.
> >>
> >> Isn't that exactly what might happen with this patch only?  Windows will
> >> not use the TSC because it finds invalid values in the TSC page.
> > 
> > Yes, it will use reference counter instead. Exactly what we want for a 
> > bisect point.
> > 
> >>  If it
> >> still uses the reference counter, we have the situation you describe.
> >>
> >> refcountTSC page
> >> Y  Y   <= after patch 2
> >> Y  N   <= after patch 1
> >> N  Y   <= impossible
> >> N  N   <= removing TSC page from this patch?
> >>
> >> Of course if the guest BSODs, it's not possible to split the patches
> >> that way.  Perhaps in that case it's simply better to do a single patch.
> >>
> > I am not sure what you are trying to say. Your option list above shows
> > that there is a value to split patches like they are split now.
> 
> Hmm, we're talking past each other. :)
> 
> I put the "?" because that's what Vadim implied ("it would make this
> particular patch non-functional"), but I don't see why it should be like
> this.  To me, the obvious way of getting the desired bisect point is
> implementing one MSR per patch.  So, moving the REFERENCE_TSC handling
> entirely to patch 2 would still be in the "refcount=Y, TSC page=N" case.
> 
This is not the case as far as I understand from Vadim. Actually,
from past discussion I remember win8 had some problem with incomplete
implementation, but that was likely pre release versions.


> In any case, this patch needs more comments and a better commit message.
>  Microsoft docs are decent, but there are several non-obvious points in
> how the patches were done, and they need to be documented.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/2] Hyper-H reference counter

2013-05-20 Thread Vadim Rozenfeld

On Mon, 2013-05-20 at 10:05 +0200, Paolo Bonzini wrote:
> Il 19/05/2013 08:37, Vadim Rozenfeld ha scritto:
> > On Thu, 2013-05-16 at 16:45 +0200, Paolo Bonzini wrote:
> >> Il 16/05/2013 16:26, Vadim Rozenfeld ha scritto:
> >>>
> >>> Yes, I have this check added in the second patch.
> >>>
> > Move it here please.
> >>> OK, will do it.
> >
> > Or better, remove all the handling of HV_X64_MSR_REFERENCE_TSC from this
> > patch, and leave it all to the second.
> >
> >>> What for? Could you please elaborate?
> >
> > To make code reviewable.  Add one MSR here, the other in the second patch.
> > removing HV_X64_MSR_REFERENCE_TSC will make this particular patch
> > completely non-functional.
> 
> Do you mean Windows guest will BSOD or just that they won't use the
> reference TSC?  If the latter, it's not a problem.
Unfortunately, it will crash. 
> 
> Paolo


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio_pci: fix macro exported in uapi

2013-05-20 Thread David Howells

Rusty Russell  wrote:

> The point of the patch is that it's unusable:
> 
> #define VIRTIO_PCI_CONFIG(dev)((dev)->msix_enabled ? 24 : 20)
> 
> ie. it's accessing a member of the kernel's virtio_pci_dev structure.

Ah, okay.  In that case, zap it and see if anyone complains.

David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/2] Hyper-H reference counter

2013-05-20 Thread Paolo Bonzini

Il 20/05/2013 10:49, Gleb Natapov ha scritto:
> On Mon, May 20, 2013 at 10:42:52AM +0200, Paolo Bonzini wrote:
>> Il 20/05/2013 10:36, Gleb Natapov ha scritto:
>>> On Mon, May 20, 2013 at 10:05:38AM +0200, Paolo Bonzini wrote:
 Il 19/05/2013 08:37, Vadim Rozenfeld ha scritto:
> On Thu, 2013-05-16 at 16:45 +0200, Paolo Bonzini wrote:
>> Il 16/05/2013 16:26, Vadim Rozenfeld ha scritto:
>>>
>>> Yes, I have this check added in the second patch.
>>>
> Move it here please.
>>> OK, will do it.
>
> Or better, remove all the handling of HV_X64_MSR_REFERENCE_TSC from 
> this
> patch, and leave it all to the second.
>
>>> What for? Could you please elaborate?
>
> To make code reviewable.  Add one MSR here, the other in the second patch.
> removing HV_X64_MSR_REFERENCE_TSC will make this particular patch
> completely non-functional.

 Do you mean Windows guest will BSOD or just that they won't use the
 reference TSC?  If the latter, it's not a problem.

>>> I think it is. If reference counter works without TSC we have a bisect
>>> point for the case when something will going wrong with TSC.
>>
>> Isn't that exactly what might happen with this patch only?  Windows will
>> not use the TSC because it finds invalid values in the TSC page.
> 
> Yes, it will use reference counter instead. Exactly what we want for a bisect 
> point.
> 
>>  If it
>> still uses the reference counter, we have the situation you describe.
>>
>> refcountTSC page
>> Y  Y   <= after patch 2
>> Y  N   <= after patch 1
>> N  Y   <= impossible
>> N  N   <= removing TSC page from this patch?
>>
>> Of course if the guest BSODs, it's not possible to split the patches
>> that way.  Perhaps in that case it's simply better to do a single patch.
>>
> I am not sure what you are trying to say. Your option list above shows
> that there is a value to split patches like they are split now.

Hmm, we're talking past each other. :)

I put the "?" because that's what Vadim implied ("it would make this
particular patch non-functional"), but I don't see why it should be like
this.  To me, the obvious way of getting the desired bisect point is
implementing one MSR per patch.  So, moving the REFERENCE_TSC handling
entirely to patch 2 would still be in the "refcount=Y, TSC page=N" case.

In any case, this patch needs more comments and a better commit message.
 Microsoft docs are decent, but there are several non-obvious points in
how the patches were done, and they need to be documented.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/2] Hyper-H reference counter

2013-05-20 Thread Gleb Natapov

On Mon, May 20, 2013 at 10:42:52AM +0200, Paolo Bonzini wrote:
> Il 20/05/2013 10:36, Gleb Natapov ha scritto:
> > On Mon, May 20, 2013 at 10:05:38AM +0200, Paolo Bonzini wrote:
> >> Il 19/05/2013 08:37, Vadim Rozenfeld ha scritto:
> >>> On Thu, 2013-05-16 at 16:45 +0200, Paolo Bonzini wrote:
>  Il 16/05/2013 16:26, Vadim Rozenfeld ha scritto:
> >
> > Yes, I have this check added in the second patch.
> >
> >>> Move it here please.
> > OK, will do it.
> >>>
> >>> Or better, remove all the handling of HV_X64_MSR_REFERENCE_TSC from 
> >>> this
> >>> patch, and leave it all to the second.
> >>>
> > What for? Could you please elaborate?
> >>>
> >>> To make code reviewable.  Add one MSR here, the other in the second patch.
> >>> removing HV_X64_MSR_REFERENCE_TSC will make this particular patch
> >>> completely non-functional.
> >>
> >> Do you mean Windows guest will BSOD or just that they won't use the
> >> reference TSC?  If the latter, it's not a problem.
> >>
> > I think it is. If reference counter works without TSC we have a bisect
> > point for the case when something will going wrong with TSC.
> 
> Isn't that exactly what might happen with this patch only?  Windows will
> not use the TSC because it finds invalid values in the TSC page.
Yes, it will use reference counter instead. Exactly what we want for a bisect 
point.

>  If it
> still uses the reference counter, we have the situation you describe.
> 
> refcountTSC page
> Y  Y   <= after patch 2
> Y  N   <= after patch 1
> N  Y   <= impossible
> N  N   <= removing TSC page from this patch?
> 
> Of course if the guest BSODs, it's not possible to split the patches
> that way.  Perhaps in that case it's simply better to do a single patch.
> 
I am not sure what you are trying to say. Your option list above shows
that there is a value to split patches like they are split now.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/2] Hyper-H reference counter

2013-05-20 Thread Paolo Bonzini

Il 20/05/2013 10:36, Gleb Natapov ha scritto:
> On Mon, May 20, 2013 at 10:05:38AM +0200, Paolo Bonzini wrote:
>> Il 19/05/2013 08:37, Vadim Rozenfeld ha scritto:
>>> On Thu, 2013-05-16 at 16:45 +0200, Paolo Bonzini wrote:
 Il 16/05/2013 16:26, Vadim Rozenfeld ha scritto:
>
> Yes, I have this check added in the second patch.
>
>>> Move it here please.
> OK, will do it.
>>>
>>> Or better, remove all the handling of HV_X64_MSR_REFERENCE_TSC from this
>>> patch, and leave it all to the second.
>>>
> What for? Could you please elaborate?
>>>
>>> To make code reviewable.  Add one MSR here, the other in the second patch.
>>> removing HV_X64_MSR_REFERENCE_TSC will make this particular patch
>>> completely non-functional.
>>
>> Do you mean Windows guest will BSOD or just that they won't use the
>> reference TSC?  If the latter, it's not a problem.
>>
> I think it is. If reference counter works without TSC we have a bisect
> point for the case when something will going wrong with TSC.

Isn't that exactly what might happen with this patch only?  Windows will
not use the TSC because it finds invalid values in the TSC page.  If it
still uses the reference counter, we have the situation you describe.

refcountTSC page
Y  Y   <= after patch 2
Y  N   <= after patch 1
N  Y   <= impossible
N  N   <= removing TSC page from this patch?

Of course if the guest BSODs, it's not possible to split the patches
that way.  Perhaps in that case it's simply better to do a single patch.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/2] Hyper-H reference counter

2013-05-20 Thread Gleb Natapov

On Mon, May 20, 2013 at 10:05:38AM +0200, Paolo Bonzini wrote:
> Il 19/05/2013 08:37, Vadim Rozenfeld ha scritto:
> > On Thu, 2013-05-16 at 16:45 +0200, Paolo Bonzini wrote:
> >> Il 16/05/2013 16:26, Vadim Rozenfeld ha scritto:
> >>>
> >>> Yes, I have this check added in the second patch.
> >>>
> > Move it here please.
> >>> OK, will do it.
> >
> > Or better, remove all the handling of HV_X64_MSR_REFERENCE_TSC from this
> > patch, and leave it all to the second.
> >
> >>> What for? Could you please elaborate?
> >
> > To make code reviewable.  Add one MSR here, the other in the second patch.
> > removing HV_X64_MSR_REFERENCE_TSC will make this particular patch
> > completely non-functional.
> 
> Do you mean Windows guest will BSOD or just that they won't use the
> reference TSC?  If the latter, it's not a problem.
> 
I think it is. If reference counter works without TSC we have a bisect
point for the case when something will going wrong with TSC.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/2] Hyper-H reference counter

2013-05-20 Thread Paolo Bonzini

Il 19/05/2013 08:37, Vadim Rozenfeld ha scritto:
> On Thu, 2013-05-16 at 16:45 +0200, Paolo Bonzini wrote:
>> Il 16/05/2013 16:26, Vadim Rozenfeld ha scritto:
>>>
>>> Yes, I have this check added in the second patch.
>>>
> Move it here please.
>>> OK, will do it.
>
> Or better, remove all the handling of HV_X64_MSR_REFERENCE_TSC from this
> patch, and leave it all to the second.
>
>>> What for? Could you please elaborate?
>
> To make code reviewable.  Add one MSR here, the other in the second patch.
> removing HV_X64_MSR_REFERENCE_TSC will make this particular patch
> completely non-functional.

Do you mean Windows guest will BSOD or just that they won't use the
reference TSC?  If the latter, it's not a problem.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: provide vhost thread per virtqueue for forwarding scenario

2013-05-20 Thread Michael S. Tsirkin

On Mon, May 20, 2013 at 02:11:19AM +, Qinchuanyu wrote:
> Vhost thread provide both tx and rx ability for virtio-net. 
> In the forwarding scenarios, tx and rx share the vhost thread, and throughput 
> is limited by single thread.
> 
> So I did a patch for provide vhost thread per virtqueue, not per vhost_net.
> 
> Of course, multi-queue virtio-net is final solution, but it require new 
> version of virtio-net working in guest.
> If you have to work with suse10,11, redhat 5.x as guest, and want to improve 
> the forward throughput,
> using vhost thread per queue seems to be the only solution.

Why is it? If multi-queue works well for you, just update the drivers in
the guests that you care about. Guest driver backport is not so hard.

In my testing, performance of thread per vq varies: some workloads might
gain throughput but you get more IPIs and more scheduling overhead, so
you waste more host CPU per byte. As you create more VMs, this stops
being a win.

> 
> I did the test with kernel 3.0.27 and qemu-1.4.0, guest is suse11-sp2, and 
> then two vhost thread provide 
> double tx/rx forwarding performance than signal vhost thread. 
> The virtqueue of vhost_blk is 1, so it still use one vhost thread without 
> change.
> 
> Is there something wrong in this solution? If not, I would list patch later.
> 
> Best regards
> King

Yes, I don't think we want to create threads even more aggressively
in all cases. I'm worried about scalability as it is.
I think we should explore a flexible approach, use a thread pool
(for example, a wq) to share threads between virtqueues,
switch to a separate thread only if there's free CPU and existing
threads are busy. Hopefully share threads between vhost instances too.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

86 matches

Mail list logo