Re: [PATCH 0/4] KVM: Dirty logging optimization using rmap

2011-11-16 Thread Takuya Yoshikawa

Adding qemu-devel to Cc.

(2011/11/14 21:39), Avi Kivity wrote:

On 11/14/2011 12:56 PM, Takuya Yoshikawa wrote:

(2011/11/14 19:25), Avi Kivity wrote:

On 11/14/2011 11:20 AM, Takuya Yoshikawa wrote:

This is a revised version of my previous work.  I hope that
the patches are more self explanatory than before.



It looks good.  I'll let Marcelo (or anyone else?) review it as well
before applying.

Do you have performance measurements?



For VGA, 30-40us became 3-5us when the display was quiet, with a
enough warmed up guest.



That's a nice improvement.


Near the criterion, the number was not different much from the
original version.

For live migration, I forgot the number but the result was good.
But my test case was not enough to cover every pattern, so I changed
the criterion to be a bit conservative.

 More tests may be able to find a better criterion.
 I am not in a hurry about this, so it is OK to add some tests
 before merging this.


I think we can merge is as is, it's clear we get an improvement.



I did a simple test to show numbers!

Here, a 4GB guest was being migrated locally during copying a file in it.


Case 1. corresponds to the original method and case 2 does to the optimized one.

Small numbers are, probably, from VGA:

Case 1. about 30us
Case 2. about 3us

Other numbers are from the system RAM (triggered by live migration):

Case 1. about 500us, 2000us
Case 2. about  80us, 2000us (not exactly averaged, see below for 
details)
* 2000us was when rmap was not used, so equal to that of case 1.

So I can say that my patch worked well for both VGA and live migration.

Takuya


=== measurement snippet ===

Case 1. kvm_mmu_slot_remove_write_access() only (same as the original method):

 qemu-system-x86-25413 [000]  6546.215009: funcgraph_entry:   | 
 write_protect_slot() {
 qemu-system-x86-25413 [000]  6546.215010: funcgraph_entry:  ! 2039.512 us 
|kvm_mmu_slot_remove_write_access();
 qemu-system-x86-25413 [000]  6546.217051: funcgraph_exit:   ! 2040.487 us 
|  }
 qemu-system-x86-25413 [002]  6546.217347: funcgraph_entry:   | 
 write_protect_slot() {
 qemu-system-x86-25413 [002]  6546.217349: funcgraph_entry:  ! 571.121 us | 
   kvm_mmu_slot_remove_write_access();
 qemu-system-x86-25413 [002]  6546.217921: funcgraph_exit:   ! 572.525 us | 
 }
 qemu-system-x86-25413 [000]  6546.314583: funcgraph_entry:   | 
 write_protect_slot() {
 qemu-system-x86-25413 [000]  6546.314585: funcgraph_entry:  + 29.598 us  | 
   kvm_mmu_slot_remove_write_access();
 qemu-system-x86-25413 [000]  6546.314616: funcgraph_exit:   + 31.053 us  | 
 }
 qemu-system-x86-25413 [000]  6546.314784: funcgraph_entry:   | 
 write_protect_slot() {
 qemu-system-x86-25413 [000]  6546.314785: funcgraph_entry:  ! 2002.591 us 
|kvm_mmu_slot_remove_write_access();
 qemu-system-x86-25413 [000]  6546.316788: funcgraph_exit:   ! 2003.537 us 
|  }
 qemu-system-x86-25413 [000]  6546.317082: funcgraph_entry:   | 
 write_protect_slot() {
 qemu-system-x86-25413 [000]  6546.317083: funcgraph_entry:  ! 624.445 us | 
   kvm_mmu_slot_remove_write_access();
 qemu-system-x86-25413 [000]  6546.317709: funcgraph_exit:   ! 625.861 us | 
 }
 qemu-system-x86-25413 [000]  6546.414261: funcgraph_entry:   | 
 write_protect_slot() {
 qemu-system-x86-25413 [000]  6546.414263: funcgraph_entry:  + 29.593 us  | 
   kvm_mmu_slot_remove_write_access();
 qemu-system-x86-25413 [000]  6546.414293: funcgraph_exit:   + 30.944 us  | 
 }
 qemu-system-x86-25413 [000]  6546.414528: funcgraph_entry:   | 
 write_protect_slot() {
 qemu-system-x86-25413 [000]  6546.414529: funcgraph_entry:  ! 1990.363 us 
|kvm_mmu_slot_remove_write_access();
 qemu-system-x86-25413 [000]  6546.416520: funcgraph_exit:   ! 1991.370 us 
|  }
 qemu-system-x86-25413 [000]  6546.416775: funcgraph_entry:   | 
 write_protect_slot() {
 qemu-system-x86-25413 [000]  6546.416776: funcgraph_entry:  ! 594.333 us | 
   kvm_mmu_slot_remove_write_access();
 qemu-system-x86-25413 [000]  6546.417371: funcgraph_exit:   ! 595.415 us | 
 }
 qemu-system-x86-25413 [000]  6546.514133: funcgraph_entry:   | 
 write_protect_slot() {
 qemu-system-x86-25413 [000]  6546.514135: funcgraph_entry:  + 24.032 us  | 
   kvm_mmu_slot_remove_write_access();
 qemu-system-x86-25413 [000]  6546.514160: funcgraph_exit:   + 25.074 us  | 
 }
 qemu-system-x86-25413 [000]  6546.514312: funcgraph_entry:   | 
 write_protect_slot() {
 qemu-system-x86-25413 [000]  6546.514313: funcgraph_entry:  ! 2035.365 us 
|kvm_mmu_slot_remove_write_access();
 qemu-system-x86-25413 [000]  6546.516349: funcgraph_exit:   ! 2036.298 us 
|  }
 qemu-system-x86-25413 [000]  6546.516642: funcgraph_entry:   | 
 write_protect_slot() {
 

Re: [PATCHv2 RFC] virtio-spec: flexible configuration layout

2011-11-16 Thread Sasha Levin
On Wed, 2011-11-16 at 09:21 +0200, Michael S. Tsirkin wrote:
 On Wed, Nov 16, 2011 at 10:28:52AM +1030, Rusty Russell wrote:
  On Fri, 11 Nov 2011 09:39:13 +0200, Sasha Levin levinsasha...@gmail.com 
  wrote:
   On Fri, Nov 11, 2011 at 6:24 AM, Rusty Russell ru...@rustcorp.com.au 
   wrote:
(2) There's no huge win in keeping the same layout.  Let's make some
   cleanups.  There are more users ahead of us then behind us (I
   hope!).
   
   Actually, if we already do cleanups, here are two more suggestions:
   
   1. Make 64bit features a one big 64bit block, instead of having 32bits
   in one place and 32 in another.
   2. Remove the reserved fields out of the config (the ones that were
   caused by moving the ISR and the notifications out).
  
  Yes, those were exactly what I was thinking.  I left it vague because
  there might be others you can see if we're prepared to abandon the
  current format.
  
  Cheers,
  Rusty.
 
 Yes but driver code doesn't get any cleaner by moving the fields.
 And in fact, the legacy support makes the code messier.
 What are the advantages?
 

What about splitting the parts which handle legacy code and new code?
It'll make it easier playing with the new spec more freely and will also
make it easier removing legacy code in the future since you'll need to
simply delete a chunk of code instead of removing legacy bits out of
working code with a surgical knife.

-- 

Sasha.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] KVM: Dirty logging optimization using rmap

2011-11-16 Thread Avi Kivity
On 11/16/2011 06:28 AM, Takuya Yoshikawa wrote:
 (2011/11/14 21:39), Avi Kivity wrote:
 There was a patchset from Peter Zijlstra that converted mmu notifiers to
 be preemptible, with that, we can convert the mmu spinlock to a mutex,
 I'll see what happened to it.

 Interesting!

 There is a third method of doing write protection, and that is by
 write-protecting at the higher levels of the paging hierarchy.  The
 advantage there is that write protection is O(1) no matter how large the
 guest is, or the number of dirty pages.

 To write protect all guest memory, we just write protect the 512 PTEs at
 the very top, and leave the rest alone.  When the guest writes to a
 page, we allow writes for the top-level PTE that faulted, and
 write-protect all the PTEs that it points to.

 One important point is that the guest, not GET DIRTY LOG caller, will pay
 for the write protection at the timing of faults.

I don't think there is a significant difference.  The number of write
faults does not change.  The amount of work done per fault does, but not
by much, thanks to the writeable bitmap.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 RFC] virtio-spec: flexible configuration layout

2011-11-16 Thread Michael S. Tsirkin
On Wed, Nov 16, 2011 at 10:17:39AM +0200, Sasha Levin wrote:
 On Wed, 2011-11-16 at 09:21 +0200, Michael S. Tsirkin wrote:
  On Wed, Nov 16, 2011 at 10:28:52AM +1030, Rusty Russell wrote:
   On Fri, 11 Nov 2011 09:39:13 +0200, Sasha Levin levinsasha...@gmail.com 
   wrote:
On Fri, Nov 11, 2011 at 6:24 AM, Rusty Russell ru...@rustcorp.com.au 
wrote:
 (2) There's no huge win in keeping the same layout.  Let's make some
cleanups.  There are more users ahead of us then behind us (I
hope!).

Actually, if we already do cleanups, here are two more suggestions:

1. Make 64bit features a one big 64bit block, instead of having 32bits
in one place and 32 in another.
2. Remove the reserved fields out of the config (the ones that were
caused by moving the ISR and the notifications out).
   
   Yes, those were exactly what I was thinking.  I left it vague because
   there might be others you can see if we're prepared to abandon the
   current format.
   
   Cheers,
   Rusty.
  
  Yes but driver code doesn't get any cleaner by moving the fields.
  And in fact, the legacy support makes the code messier.
  What are the advantages?
  

The advantages question is what should really balance out the overhead.

 What about splitting the parts which handle legacy code and new code?

Well, I considered that. Something along the lines of
#define VIRTIO_NEW_MSI_CONFIG_VECTOR18
And so on for all registers.

This seems to add a significant maintainance burden because of code
duplication. Note that, for example, vector programming is affected.
Multiply that by the number of guest OSes.


 It'll make it easier playing with the new spec more freely

I'm really worried about maintaing drivers long term.
Ease of experimentation is secondary for me.

 and will also
 make it easier removing legacy code in the future since you'll need to
 simply delete a chunk of code instead of removing legacy bits out of
 working code with a surgical knife.

It's unlikely to be a single chunk: we'd have structures and macros
which are separate. So at least 3 chunks.

Just for fun, here's what's involved in removing legacy map
support on top of my patch. As you see there are 4 chunks:
structure decl, map, unmap, and msix enable/disable.
And finding them was as simple as looking for legacy_map.


---

diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index d242fcc..6c4d2faf 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -64,9 +64,6 @@ struct virtio_pci_device
 
/* Various IO mappings: used for resource tracking only. */
 
-   /* Legacy BAR0: typically PIO. */
-   void __iomem *legacy_map;
-
/* Mappings specified by device capabilities: typically in MMIO */
void __iomem *isr_map;
void __iomem *notify_map;
@@ -81,11 +78,7 @@ struct virtio_pci_device
 static void virtio_pci_set_msix_enabled(struct virtio_pci_device *vp_dev, int 
enabled)
 {
vp_dev-msix_enabled = enabled;
-   if (vp_dev-device_map)
-   vp_dev-ioaddr_device = vp_dev-device_map;
-   else
-   vp_dev-ioaddr_device = vp_dev-legacy_map +
-   VIRTIO_PCI_CONFIG(vp_dev);
+   vp_dev-ioaddr_device = vp_dev-device_map;
 }
 
 static void __iomem *virtio_pci_map_cfg(struct virtio_pci_device *vp_dev, u8 
cap_id,
@@ -147,8 +140,6 @@ err:
 
 static void virtio_pci_iounmap(struct virtio_pci_device *vp_dev)
 {
-   if (vp_dev-legacy_map)
-   pci_iounmap(vp_dev-pci_dev, vp_dev-legacy_map);
if (vp_dev-isr_map)
pci_iounmap(vp_dev-pci_dev, vp_dev-isr_map);
if (vp_dev-notify_map)
@@ -176,36 +167,15 @@ static int virtio_pci_iomap(struct virtio_pci_device 
*vp_dev)
 
if (!vp_dev-notify_map || !vp_dev-common_map ||
!vp_dev-device_map) {
-   /*
-* If not all capabilities present, map legacy PIO.
-* Legacy access is at BAR 0. We never need to map
-* more than 256 bytes there, since legacy config space
-* used PIO which has this size limit.
-* */
-   vp_dev-legacy_map = pci_iomap(vp_dev-pci_dev, 0, 256);
-   if (!vp_dev-legacy_map) {
-   dev_err(vp_dev-vdev.dev, Unable to map legacy PIO);
-   goto err;
-   }
+   dev_err(vp_dev-vdev.dev, Unable to map IO);
+   goto err;
}
 
-   /* Prefer MMIO if available. If not, fallback to legacy PIO. */
-   if (vp_dev-common_map)
-   vp_dev-ioaddr = vp_dev-common_map;
-   else
-   vp_dev-ioaddr = vp_dev-legacy_map;
+   vp_dev-ioaddr = vp_dev-common_map;
 
-   if (vp_dev-device_map)
-   vp_dev-ioaddr_device = vp_dev-device_map;
-   else
-   vp_dev-ioaddr_device = vp_dev-legacy_map +
-   VIRTIO_PCI_CONFIG(vp_dev);
+   

Re: [RFC] kvm tools: Implement multiple VQ for virtio-net

2011-11-16 Thread Krishna Kumar2
jason wang jasow...@redhat.com wrote on 11/16/2011 11:40:45 AM:

Hi Jason,

 Have any thought in mind to solve the issue of flow handling?

So far nothing concrete.

 Maybe some performance numbers first is better, it would let us know
 where we are. During the test of my patchset, I find big regression of
 small packet transmission, and more retransmissions were noticed. This
 maybe also the issue of flow affinity. One interesting things is to see
 whether this happens in your patches :)

I haven't got any results for small packet, but will run this week
and send an update. I remember my earlier patches having regression
for small packets.

 I've played with a basic flow director implementation based on my series
 which want to make sure the packets of a flow was handled by the same
 vhost thread/guest vcpu. This is done by:

 - bind virtqueue to guest cpu
 - record the hash to queue mapping when guest sending packets and use
 this mapping to choose the virtqueue when forwarding packets to guest

 Test shows some help during for receiving packets from external host and
 packet sending to local host. But it would hurt the performance of
 sending packets to remote host. This is not the perfect solution as it
 can not handle guest moving processes among vcpus, I plan to try
 accelerate RFS and sharing the mapping between host and guest.

 Anyway this is just for receiving, the small packet sending need more
 thoughts.

I don't recollect small packet performance for guest-local host.
Also, using multiple tuns devices on the bridge (instead of mq-tun)
balances the rx/tx of a flow to a single vq. Then you can avoid
mq-tun with it's queue selector function, etc. Have you tried it?

I will run my tests this week and get back.

thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] Introduce iommu_commit() function

2011-11-16 Thread Avi Kivity
On 06/23/2011 06:38 PM, David Woodhouse wrote:
 On Thu, 2011-06-23 at 17:31 +0200, Joerg Roedel wrote:
  David, I think especially VT-d can benefit from such a callback. I will
  implement support for it in the AMD IOMMU driver and post a patch-set
  soon.
  
  Any comments, thoughts?

 Ick. We *already* do the flushes as appropriate while we're filling the
 page tables. So every time we move on from one page table page to the
 next, we'll flush the old one. And when we've *done* filling the page
 tables for the range we've been asked to map, we flush the last writes
 too.

For the current kvm use case flushing just once on commit is most
efficient.  If/when we get resumable io faults, per-page flushing
becomes worthwhile.

 The problem with KVM is that it calls us over and over again to map a
 single 4KiB page.

 It doesn't seem simple to make use of a 'commit' function, because we'd
 have to keep track of *which* page tables are dirty.

You could easily do that by using a free bit in the pte as a dirty bit. 
You can then choose whether to use per-page flush or a full flush.

 I'd much rather KVM just gave us a list of the pages to map, in a single
 call. 

The list can easily be several million pages long.

 Or even a 'translation' callback we could call to get the physical
 address for each page in the range.

This is doable, and is probably most flexible.  If the translation also
returns ranges, then you don't have to figure out large mappings yourself.

Not that there's a huge difference between

   iommu_begin(iommu_transaction, domain)
   for (page in range)
 iommu_map(iommu_transaction, page, translate(page))
iommu_commit(iommu_transaction)

and

   iommu_map(domain, range, translate)

- one can be converted to the other.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] kvm tools: Implement multiple VQ for virtio-net

2011-11-16 Thread jason wang
On 11/16/2011 05:09 PM, Krishna Kumar2 wrote:
 jason wang jasow...@redhat.com wrote on 11/16/2011 11:40:45 AM:

 Hi Jason,

 Have any thought in mind to solve the issue of flow handling?
 So far nothing concrete.

 Maybe some performance numbers first is better, it would let us know
 where we are. During the test of my patchset, I find big regression of
 small packet transmission, and more retransmissions were noticed. This
 maybe also the issue of flow affinity. One interesting things is to see
 whether this happens in your patches :)
 I haven't got any results for small packet, but will run this week
 and send an update. I remember my earlier patches having regression
 for small packets.

 I've played with a basic flow director implementation based on my series
 which want to make sure the packets of a flow was handled by the same
 vhost thread/guest vcpu. This is done by:

 - bind virtqueue to guest cpu
 - record the hash to queue mapping when guest sending packets and use
 this mapping to choose the virtqueue when forwarding packets to guest

 Test shows some help during for receiving packets from external host and
 packet sending to local host. But it would hurt the performance of
 sending packets to remote host. This is not the perfect solution as it
 can not handle guest moving processes among vcpus, I plan to try
 accelerate RFS and sharing the mapping between host and guest.

 Anyway this is just for receiving, the small packet sending need more
 thoughts.
 I don't recollect small packet performance for guest-local host.
 Also, using multiple tuns devices on the bridge (instead of mq-tun)
 balances the rx/tx of a flow to a single vq. Then you can avoid
 mq-tun with it's queue selector function, etc. Have you tried it?

I remember it works when I test your patchset early this year, but don't
measure its performance. If multiple tuns devices were used, the mac
address table would be updated very frequently and packets can not be
forwarded in parallel ( unless we make bridge to support multiqueue ).


 I will run my tests this week and get back.

 thanks,

 - KK


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] kvm tools: Add optional callbacks for VQs

2011-11-16 Thread Sasha Levin
This patch adds optional callbacks which get called when the VQ gets assigned
an eventfd for notifications, and when it gets assigned with a GSI.

This allows the device to pass the eventfds to 3rd parties which can use
them to notify and get notifications regarding the VQ.

Signed-off-by: Sasha Levin levinsasha...@gmail.com
---
 tools/kvm/include/kvm/virtio-trans.h |2 ++
 tools/kvm/virtio/pci.c   |6 ++
 2 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/tools/kvm/include/kvm/virtio-trans.h 
b/tools/kvm/include/kvm/virtio-trans.h
index d9f4b95..e7c186e 100644
--- a/tools/kvm/include/kvm/virtio-trans.h
+++ b/tools/kvm/include/kvm/virtio-trans.h
@@ -20,6 +20,8 @@ struct virtio_ops {
int (*notify_vq)(struct kvm *kvm, void *dev, u32 vq);
int (*get_pfn_vq)(struct kvm *kvm, void *dev, u32 vq);
int (*get_size_vq)(struct kvm *kvm, void *dev, u32 vq);
+   void (*notify_vq_gsi)(struct kvm *kvm, void *dev, u32 vq, u32 gsi);
+   void (*notify_vq_eventfd)(struct kvm *kvm, void *dev, u32 vq, u32 efd);
 };
 
 struct virtio_trans_ops {
diff --git a/tools/kvm/virtio/pci.c b/tools/kvm/virtio/pci.c
index 1660f06..0737ae7 100644
--- a/tools/kvm/virtio/pci.c
+++ b/tools/kvm/virtio/pci.c
@@ -51,6 +51,9 @@ static int virtio_pci__init_ioeventfd(struct kvm *kvm, struct 
virtio_trans *vtra
 
ioeventfd__add_event(ioevent);
 
+   if (vtrans-virtio_ops-notify_vq_eventfd)
+   vtrans-virtio_ops-notify_vq_eventfd(kvm, vpci-dev, vq, 
ioevent.fd);
+
return 0;
 }
 
@@ -152,6 +155,9 @@ static bool virtio_pci__specific_io_out(struct kvm *kvm, 
struct virtio_trans *vt
 
gsi = irq__add_msix_route(kvm, 
vpci-msix_table[vec].msg);
vpci-gsis[vpci-queue_selector] = gsi;
+   if (vtrans-virtio_ops-notify_vq_gsi)
+   vtrans-virtio_ops-notify_vq_gsi(kvm, 
vpci-dev,
+   vpci-queue_selector, 
gsi);
break;
}
};
-- 
1.7.8.rc1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] kvm tools: Add vhost-net support

2011-11-16 Thread Sasha Levin
This patch adds support to using the vhost-net device when using a tap backed
virtio-net device.

Activating vhost-net is done by appending a 'vhost=1' flag to the net device
configuration. For example:

'kvm run -n mode=tap,vhost=1'

Cc: Michael S. Tsirkin m...@redhat.com
Signed-off-by: Sasha Levin levinsasha...@gmail.com
---
 tools/kvm/builtin-run.c|2 +
 tools/kvm/include/kvm/virtio-net.h |1 +
 tools/kvm/virtio/net.c |  120 +++-
 3 files changed, 122 insertions(+), 1 deletions(-)

diff --git a/tools/kvm/builtin-run.c b/tools/kvm/builtin-run.c
index 13025db..3b00bf0 100644
--- a/tools/kvm/builtin-run.c
+++ b/tools/kvm/builtin-run.c
@@ -217,6 +217,8 @@ static int set_net_param(struct virtio_net_params *p, const 
char *param,
p-guest_ip = strdup(val);
} else if (strcmp(param, host_ip) == 0) {
p-host_ip = strdup(val);
+   } else if (strcmp(param, vhost) == 0) {
+   p-vhost = atoi(val);
}
 
return 0;
diff --git a/tools/kvm/include/kvm/virtio-net.h 
b/tools/kvm/include/kvm/virtio-net.h
index 58ae162..dade8cb 100644
--- a/tools/kvm/include/kvm/virtio-net.h
+++ b/tools/kvm/include/kvm/virtio-net.h
@@ -11,6 +11,7 @@ struct virtio_net_params {
char host_mac[6];
struct kvm *kvm;
int mode;
+   int vhost;
 };
 
 void virtio_net__init(const struct virtio_net_params *params);
diff --git a/tools/kvm/virtio/net.c b/tools/kvm/virtio/net.c
index cee2b5b..58ca4ed 100644
--- a/tools/kvm/virtio/net.c
+++ b/tools/kvm/virtio/net.c
@@ -10,6 +10,7 @@
 #include kvm/guest_compat.h
 #include kvm/virtio-trans.h
 
+#include linux/vhost.h
 #include linux/virtio_net.h
 #include linux/if_tun.h
 #include linux/types.h
@@ -25,6 +26,7 @@
 #include sys/ioctl.h
 #include sys/types.h
 #include sys/wait.h
+#include sys/eventfd.h
 
 #define VIRTIO_NET_QUEUE_SIZE  128
 #define VIRTIO_NET_NUM_QUEUES  2
@@ -57,6 +59,7 @@ struct net_dev {
pthread_mutex_t io_tx_lock;
pthread_cond_t  io_tx_cond;
 
+   int vhost_fd;
int tap_fd;
chartap_name[IFNAMSIZ];
 
@@ -323,9 +326,12 @@ static void set_guest_features(struct kvm *kvm, void *dev, 
u32 features)
 
 static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 pfn)
 {
+   struct vhost_vring_state state = { .index = vq };
+   struct vhost_vring_addr addr;
struct net_dev *ndev = dev;
struct virt_queue *queue;
void *p;
+   int r;
 
compat__remove_message(compat_id);
 
@@ -335,9 +341,82 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 
pfn)
 
vring_init(queue-vring, VIRTIO_NET_QUEUE_SIZE, p, 
VIRTIO_PCI_VRING_ALIGN);
 
+   if (ndev-vhost_fd == 0)
+   return 0;
+
+   state.num = queue-vring.num;
+   r = ioctl(ndev-vhost_fd, VHOST_SET_VRING_NUM, state);
+   if (r  0)
+   die_perror(VHOST_SET_VRING_NUM failed);
+   state.num = 0;
+   r = ioctl(ndev-vhost_fd, VHOST_SET_VRING_BASE, state);
+   if (r  0)
+   die_perror(VHOST_SET_VRING_BASE failed);
+
+   addr = (struct vhost_vring_addr) {
+   .index = vq,
+   .desc_user_addr = (u64)(unsigned long)queue-vring.desc,
+   .avail_user_addr = (u64)(unsigned long)queue-vring.avail,
+   .used_user_addr = (u64)(unsigned long)queue-vring.used,
+   };
+
+   r = ioctl(ndev-vhost_fd, VHOST_SET_VRING_ADDR, addr);
+   if (r  0)
+   die_perror(VHOST_SET_VRING_ADDR failed);
+
return 0;
 }
 
+static void notify_vq_gsi(struct kvm *kvm, void *dev, u32 vq, u32 gsi)
+{
+   struct net_dev *ndev = dev;
+   struct kvm_irqfd irq;
+   struct vhost_vring_file file;
+   int r;
+
+   if (ndev-vhost_fd == 0)
+   return;
+
+   irq = (struct kvm_irqfd) {
+   .gsi= gsi,
+   .fd = eventfd(0, 0),
+   };
+   file = (struct vhost_vring_file) {
+   .index  = vq,
+   .fd = irq.fd,
+   };
+
+   r = ioctl(kvm-vm_fd, KVM_IRQFD, irq);
+   if (r  0)
+   die_perror(KVM_IRQFD failed);
+
+   r = ioctl(ndev-vhost_fd, VHOST_SET_VRING_CALL, file);
+   if (r  0)
+   die_perror(VHOST_SET_VRING_CALL failed);
+   file.fd = ndev-tap_fd;
+   r = ioctl(ndev-vhost_fd, VHOST_NET_SET_BACKEND, file);
+   if (r != 0)
+   die(VHOST_NET_SET_BACKEND failed %d, errno);
+
+}
+
+static void notify_vq_eventfd(struct kvm *kvm, void *dev, u32 vq, u32 efd)
+{
+   struct net_dev *ndev = dev;
+   struct vhost_vring_file file = {
+   .index  = vq,
+   .fd = efd,
+   };
+   int r;
+
+   if (ndev-vhost_fd == 0)
+   return;
+
+   r = ioctl(ndev-vhost_fd, 

Re: [PATCH 2/2] kvm tools: Add vhost-net support

2011-11-16 Thread Sasha Levin
On Wed, 2011-11-16 at 14:24 +0200, Sasha Levin wrote:
 This patch adds support to using the vhost-net device when using a tap backed
 virtio-net device.
 
 Activating vhost-net is done by appending a 'vhost=1' flag to the net device
 configuration. For example:
 
   'kvm run -n mode=tap,vhost=1'
 
 Cc: Michael S. Tsirkin m...@redhat.com
 Signed-off-by: Sasha Levin levinsasha...@gmail.com
 ---

I forgot to attach performance numbers to the changelog, so here they
are:

Short version
--

TCP Throughput: +29%
UDP Throughput: +10%
TCP Latency: -15%
UDP Latency: -12%


Long version
--

MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
192.168.33.4 (192.168.33.4) port 0 AF_INET
Recv   SendSend  
Socket Socket  Message  Elapsed  
Size   SizeSize Time Throughput  
bytes  bytes   bytessecs.10^6bits/sec  

 87380  16384  1638410.004895.04

MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
192.168.33.4 (192.168.33.4) port 0 AF_INET
Socket  Message  Elapsed  Messages
SizeSize Time Okay Errors   Throughput
bytes   bytessecs#  #   10^6bits/sec

229376   65507   10.00  125287  06565.60
229376   10.00  106910   5602.57

MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to 192.168.33.4 (192.168.33.4) port 0 AF_INET : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size SizeTime Rate 
bytes  Bytes  bytesbytes   secs.per sec   

16384  87380  11   10.0014811.55

MIGRATED UDP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to 192.168.33.4 (192.168.33.4) port 0 AF_INET : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size SizeTime Rate 
bytes  Bytes  bytesbytes   secs.per sec   

229376 229376 11   10.0016000.44   
229376 229376

After:

MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
192.168.33.4 (192.168.33.4) port 0 AF_INET
Recv   SendSend  
Socket Socket  Message  Elapsed  
Size   SizeSize Time Throughput  
bytes  bytes   bytessecs.10^6bits/sec  

 87380  16384  1638410.006340.74

MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
192.168.33.4 (192.168.33.4) port 0 AF_INET
Socket  Message  Elapsed  Messages
SizeSize Time Okay Errors   Throughput
bytes   bytessecs#  #   10^6bits/sec

229376   65507   10.00  131478  06890.09
229376   10.00  118136   6190.90

MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to 192.168.33.4 (192.168.33.4) port 0 AF_INET : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size SizeTime Rate 
bytes  Bytes  bytesbytes   secs.per sec   

16384  87380  11   10.0017126.10

MIGRATED UDP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to 192.168.33.4 (192.168.33.4) port 0 AF_INET : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size SizeTime Rate 
bytes  Bytes  bytesbytes   secs.per sec   

229376 229376 11   10.0017944.51

-- 

Sasha.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] Introduce iommu_commit() function

2011-11-16 Thread Joerg Roedel
On Wed, Nov 16, 2011 at 11:00:56AM +0900, KyongHo Cho wrote:
 On Wed, Jun 29, 2011 at 2:51 PM, Joerg Roedel j...@8bytes.org wrote:

 In the 'next' branch of
 http://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git,
 I found that iommu_commit() is removed.
 
 Why is it removed?

It was never in the next-branch. It actually is in the master-branch,
but that happened accidentially :)
The reason is that there is not enough consensus about this interface
yet. This is also the reason I havn't pushed it upstream yet.


Joerg
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] kvm tools: Add support for virtio-mmio

2011-11-16 Thread Pawel Moll
On Tue, 2011-11-15 at 17:56 +, Sasha Levin wrote:
 Hmm... If thats the plan, it should probably be a virtio thing (not
 virtio-mmio specific).
 
 Either way, it could also use some clarification in the spec.

Well, the spec (p. 2.1) says: The Subsystem Vendor ID should reflect
the PCI Vendor ID of the environment (it's currently only used for
informational purposes by the guest).. The fact is that all the current
virtio drivers simply ignore this field. So unless this changes I simply
have no idea how to describe that register. Put anything there, no one
cares? Write zero now, may change in future? Any ideas welcomed.

Cheers!

Paweł

PS. Thanks for defending my honour in the delayed-explosive-device
thread ;-)


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] kvm tools: Add support for virtio-mmio

2011-11-16 Thread Sasha Levin
On Wed, 2011-11-16 at 13:21 +, Pawel Moll wrote:
 On Tue, 2011-11-15 at 17:56 +, Sasha Levin wrote:
  Hmm... If thats the plan, it should probably be a virtio thing (not
  virtio-mmio specific).
  
  Either way, it could also use some clarification in the spec.
 
 Well, the spec (p. 2.1) says: The Subsystem Vendor ID should reflect
 the PCI Vendor ID of the environment (it's currently only used for
 informational purposes by the guest).. The fact is that all the current
 virtio drivers simply ignore this field. So unless this changes I simply
 have no idea how to describe that register. Put anything there, no one
 cares? Write zero now, may change in future? Any ideas welcomed.
 
 Cheers!
 
 Paweł
 
 PS. Thanks for defending my honour in the delayed-explosive-device
 thread ;-)

We can add an appendix to the virtio spec with known virtio subsystem
vendors, patch QEMU  KVM tool to pass that, and possibly modify the
QEMU related workarounds in the kernel to only do the workaround thing
if QEMU is set as the vendor.

-- 

Sasha.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-16 Thread Konrad Rzeszutek Wilk
On Fri, Nov 11, 2011 at 03:10:56PM -0700, Alex Williamson wrote:
 
 Thanks Konrad!  Comments inline.
 
 On Fri, 2011-11-11 at 12:51 -0500, Konrad Rzeszutek Wilk wrote:
  On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
   VFIO provides a secure, IOMMU based interface for user space
   drivers, including device assignment to virtual machines.
   This provides the base management of IOMMU groups, devices,
   and IOMMU objects.  See Documentation/vfio.txt included in
   this patch for user and kernel API description.
   
   Note, this implements the new API discussed at KVM Forum
   2011, as represented by the drvier version 0.2.  It's hoped
   that this provides a modular enough interface to support PCI
   and non-PCI userspace drivers across various architectures
   and IOMMU implementations.
   
   Signed-off-by: Alex Williamson alex.william...@redhat.com
   ---
   
   Fingers crossed, this is the last RFC for VFIO, but we need
   the iommu group support before this can go upstream
   (http://lkml.indiana.edu/hypermail/linux/kernel/1110.2/02303.html),
   hoping this helps push that along.
   
   Since the last posting, this version completely modularizes
   the device backends and better defines the APIs between the
   core VFIO code and the device backends.  I expect that we
   might also adopt a modular IOMMU interface as iommu_ops learns
   about different types of hardware.  Also many, many cleanups.
   Check the complete git history for details:
   
   git://github.com/awilliam/linux-vfio.git vfio-ng
   
   (matching qemu tree: git://github.com/awilliam/qemu-vfio.git)
   
   This version, along with the supporting VFIO PCI backend can
   be found here:
   
   git://github.com/awilliam/linux-vfio.git vfio-next-2003
   
   I've held off on implementing a kernel-user signaling
   mechanism for now since the previous netlink version produced
   too many gag reflexes.  It's easy enough to set a bit in the
   group flags too indicate such support in the future, so I
   think we can move ahead without it.
   
   Appreciate any feedback or suggestions.  Thanks,
   
   Alex
   
Documentation/ioctl/ioctl-number.txt |1 
Documentation/vfio.txt   |  304 +
MAINTAINERS  |8 
drivers/Kconfig  |2 
drivers/Makefile |1 
drivers/vfio/Kconfig |8 
drivers/vfio/Makefile|3 
drivers/vfio/vfio_iommu.c|  530 
drivers/vfio/vfio_main.c | 1151 
   ++
drivers/vfio/vfio_private.h  |   34 +
include/linux/vfio.h |  155 +
11 files changed, 2197 insertions(+), 0 deletions(-)
create mode 100644 Documentation/vfio.txt
create mode 100644 drivers/vfio/Kconfig
create mode 100644 drivers/vfio/Makefile
create mode 100644 drivers/vfio/vfio_iommu.c
create mode 100644 drivers/vfio/vfio_main.c
create mode 100644 drivers/vfio/vfio_private.h
create mode 100644 include/linux/vfio.h
   
   diff --git a/Documentation/ioctl/ioctl-number.txt 
   b/Documentation/ioctl/ioctl-number.txt
   index 54078ed..59d01e4 100644
   --- a/Documentation/ioctl/ioctl-number.txt
   +++ b/Documentation/ioctl/ioctl-number.txt
   @@ -88,6 +88,7 @@ Code  Seq#(hex) Include FileComments
 and kernel/power/user.c
'8'  all SNP8023 advanced NIC card
 mailto:m...@solidum.com
   +';'  64-76   linux/vfio.h
'@'  00-0F   linux/radeonfb.hconflict!
'@'  00-0F   drivers/video/aty/aty128fb.cconflict!
'A'  00-1F   linux/apm_bios.hconflict!
   diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
   new file mode 100644
   index 000..5866896
   --- /dev/null
   +++ b/Documentation/vfio.txt
   @@ -0,0 +1,304 @@
   +VFIO - Virtual Function I/O[1]
   +---
   +Many modern system now provide DMA and interrupt remapping facilities
   +to help ensure I/O devices behave within the boundaries they've been
   +allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d as
   +well as POWER systems with Partitionable Endpoints (PEs) and even
   +embedded powerpc systems (technology name unknown).  The VFIO driver
   +is an IOMMU/device agnostic framework for exposing direct device
   +access to userspace, in a secure, IOMMU protected environment.  In
   +other words, this allows safe, non-privileged, userspace drivers.
   +
   +Why do we want that?  Virtual machines often make use of direct device
   +access (device assignment) when configured for the highest possible
   +I/O performance.  From a device and host perspective, this simply turns
   +the VM into a userspace driver, with the benefits of significantly
   +reduced latency, higher 

[ANNOUNCE] qemu-kvm-1.0-rc2

2011-11-16 Thread Avi Kivity
qemu-kvm-1.0-rc2 is now available. This release is based on the upstream
qemu 1.0-rc2, plus kvm-specific enhancements.

This release can be used with the kvm kernel modules provided by your
distribution kernel, or by the modules in the kvm-kmod package, such
as kvm-kmod-3.1.


http://www.linux-kvm.org
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-16 Thread Scott Wood
On 11/11/2011 04:10 PM, Alex Williamson wrote:
 
 Thanks Konrad!  Comments inline.
 
 On Fri, 2011-11-11 at 12:51 -0500, Konrad Rzeszutek Wilk wrote:
 On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
 +When supported, as indicated by the device flags, reset the device.
 +
 +#define VFIO_DEVICE_RESET   _IO(';', 116)

 Does it disable the 'count'? Err, does it disable the IRQ on the
 device after this and one should call VFIO_DEVICE_SET_IRQ_EVENTFDS
 to set new eventfds? Or does it re-use the eventfds and the device
 is enabled after this?
 
 It doesn't affect the interrupt programming.  Should it?

It should probably clear any currently pending interrupts, as if the
unmask IOCTL were called.

 +device tree properties of the device:
 +
 +struct vfio_dtpath {
 +__u32   len;/* length of structure */
 +__u32   index;

 0 based I presume?
 
 Everything else is, I would assume so/

Yes, it should be zero-based -- this matches how such indices are done
in the kernel device tree APIs.

 +__u64   flags;
 +#define VFIO_DTPATH_FLAGS_REGION(1  0)

 What is region in this context?? Or would this make much more sense
 if I knew what Device Tree actually is.
 
 Powerpc guys, any comments?  This was their suggestion.  These are
 effectively the first device specific extension, available when
 VFIO_DEVICE_FLAGS_DT is set.

An assigned device may consist of an entire subtree of the device tree,
and both register banks and interrupts can come from any node in the
tree.  Region versus IRQ here indicates the context in which to
interpret index, in order to retrieve the path of the node that supplied
this particular region or IRQ.

 +};
 +#define VFIO_DEVICE_GET_DTPATH  _IOWR(';', 117, struct vfio_dtpath)
 +
 +struct vfio_dtindex {
 +__u32   len;/* length of structure */
 +__u32   index;
 +__u32   prop_type;

 Is that an enum type? Is this definied somewhere?
 +__u32   prop_index;

 What is the purpose of this field?
 
 Need input from powerpc folks here

To identify what this resource (register bank or IRQ) this is, we need
both the path to the node and the index into the reg or interrupts
property within the node.

We also need to distinguish reg from ranges, and interrupts from
interrupt-map.  As you suggested elsewhere in the thread, the device
tree API should probably be left out for now, and added later along with
the device tree bus driver.

 +static void __vfio_iommu_detach_dev(struct vfio_iommu *iommu,
 +   struct vfio_device *device)
 +{
 +   BUG_ON(!iommu-domain  device-attached);

 Whoa. Heavy hammer there.

 Perhaps WARN_ON as you do check it later on.
 
 I think it's warranted, internal consistency is broken if we have a
 device that thinks it's attached to an iommu domain that doesn't exist.
 It should, of course, never happen and this isn't a performance path.
 
[snip]
 +static int __vfio_iommu_attach_dev(struct vfio_iommu *iommu,
 +  struct vfio_device *device)
 +{
 +   int ret;
 +
 +   BUG_ON(device-attached);

 How about:

 WARN_ON(device-attached, The engineer who wrote the user-space device 
 driver is trying to register
 the device again! Tell him/her to stop please.\n);
 
 I would almost demote this one to a WARN_ON, but userspace isn't in
 control of attaching and detaching devices from the iommu.  That's a
 side effect of getting the iommu or device file descriptor.  So again,
 this is an internal consistency check and it should never happen,
 regardless of userspace.

The rule isn't to use BUG for internal consistency checks and WARN for
stuff userspace can trigger, but rather to use BUG if you cannot
reasonably continue, WARN for significant issues that need prompt
attention that are reasonably recoverable.  Most instances of WARN are
internal consistency checks.

From include/asm-generic/bug.h:
 If you're tempted to BUG(), think again:  is completely giving up
 really the *only* solution?  There are usually better options, where
 users don't need to reboot ASAP and can mostly shut down cleanly.

-Scott

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/6] pci-assign: Multiple fixes and cleanups

2011-11-16 Thread Alex Williamson
These patches are all independent.  Patch 1  2 fix serious
usability bugs.  Patches 3-6 are more subtle things that
Markus was able to find with Coverity.

Patch 1 fixes https://bugs.launchpad.net/qemu/+bug/875723

I also tested https://bugs.launchpad.net/qemu/+bug/877155
but I'm unable to reproduce.  An 82576 VF works just fine
in a Windows 2008 guest with this patch series.  Thanks,

Alex

---

Alex Williamson (6):
  pci-assign: Harden I/O port test
  pci-assign: Remove bogus PCIe lnkcap wmask setting
  pci-assign: Fix PCIe lnkcap
  pci-assign: Fix PCI_EXP_FLAGS_TYPE shift
  pci-assign: Fix I/O port
  pci-assign: Fix device removal


 hw/device-assignment.c |  137 
 1 files changed, 57 insertions(+), 80 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/6] pci-assign: Fix device removal

2011-11-16 Thread Alex Williamson
We're destroying the memory container before we remove the
subregions it holds.  This fixes:

https://bugs.launchpad.net/qemu/+bug/875723

Signed-off-by: Alex Williamson alex.william...@redhat.com
---

 hw/device-assignment.c |   13 +
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 11efd16..cde0681 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -677,10 +677,23 @@ static void free_assigned_device(AssignedDevice *dev)
 kvm_remove_ioport_region(region-u.r_baseport, region-r_size,
  dev-dev.qdev.hotplugged);
 }
+memory_region_del_subregion(region-container,
+region-real_iomem);
+memory_region_destroy(region-real_iomem);
+memory_region_destroy(region-container);
 } else if (pci_region-type  IORESOURCE_MEM) {
 if (region-u.r_virtbase) {
 memory_region_del_subregion(region-container,
 region-real_iomem);
+
+/* Remove MSI-X table subregion */
+if (pci_region-base_addr = dev-msix_table_addr 
+pci_region-base_addr + pci_region-size 
+dev-msix_table_addr) {
+memory_region_del_subregion(region-container,
+dev-mmio);
+}
+
 memory_region_destroy(region-real_iomem);
 memory_region_destroy(region-container);
 if (munmap(region-u.r_virtbase,

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/6] pci-assign: Fix I/O port

2011-11-16 Thread Alex Williamson
The old_portio structure seems broken.  Throw it away and
switch to the new style.  This was hitting an assert when
trying to make use of I/O port regions.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---

 hw/device-assignment.c |  103 
 1 files changed, 35 insertions(+), 68 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index cde0681..571a097 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -65,100 +65,76 @@ static void assigned_dev_load_option_rom(AssignedDevice 
*dev);
 
 static void assigned_dev_unregister_msix_mmio(AssignedDevice *dev);
 
-static uint32_t assigned_dev_ioport_rw(AssignedDevRegion *dev_region,
-   uint32_t addr, int len, uint32_t *val)
+static uint64_t assigned_dev_ioport_rw(AssignedDevRegion *dev_region,
+   target_phys_addr_t addr, int size,
+   uint64_t *data)
 {
-uint32_t ret = 0;
-uint32_t offset = addr;
+uint64_t val = 0;
 int fd = dev_region-region-resource_fd;
 
 if (fd = 0) {
-if (val) {
-DEBUG(pwrite val=%x, len=%d, e_phys=%x, offset=%x\n,
-  *val, len, addr, offset);
-if (pwrite(fd, val, len, offset) != len) {
+if (data) {
+DEBUG(pwrite data=%x, size=%d, e_phys=%x, addr=%x\n,
+  *data, size, addr, addr);
+if (pwrite(fd, data, size, addr) != size) {
 fprintf(stderr, %s - pwrite failed %s\n,
 __func__, strerror(errno));
 }
 } else {
-if (pread(fd, ret, len, offset) != len) {
+if (pread(fd, val, size, addr) != size) {
 fprintf(stderr, %s - pread failed %s\n,
 __func__, strerror(errno));
-ret = (1UL  (len * 8)) - 1;
+val = (1UL  (size * 8)) - 1;
 }
-DEBUG(pread ret=%x, len=%d, e_phys=%x, offset=%x\n,
-  ret, len, addr, offset);
+DEBUG(pread val=%x, size=%d, e_phys=%x, addr=%x\n,
+  val, size, addr, addr);
 }
 } else {
-uint32_t port = offset + dev_region-u.r_baseport;
+uint32_t port = addr + dev_region-u.r_baseport;
 
-if (val) {
-DEBUG(out val=%x, len=%d, e_phys=%x, host=%x\n,
-  *val, len, addr, port);
-switch (len) {
+if (data) {
+DEBUG(out data=%x, size=%d, e_phys=%x, host=%x\n,
+  *data, size, addr, port);
+switch (size) {
 case 1:
-outb(*val, port);
+outb(*data, port);
 break;
 case 2:
-outw(*val, port);
+outw(*data, port);
 break;
 case 4:
-outl(*val, port);
+outl(*data, port);
 break;
 }
 } else {
-switch (len) {
+switch (size) {
 case 1:
-ret = inb(port);
+val = inb(port);
 break;
 case 2:
-ret = inw(port);
+val = inw(port);
 break;
 case 4:
-ret = inl(port);
+val = inl(port);
 break;
 }
-DEBUG(in val=%x, len=%d, e_phys=%x, host=%x\n,
-  ret, len, addr, port);
+DEBUG(in data=%x, size=%d, e_phys=%x, host=%x\n,
+  val, size, addr, port);
 }
 }
-return ret;
-}
-
-static void assigned_dev_ioport_writeb(void *opaque, uint32_t addr,
-   uint32_t value)
-{
-assigned_dev_ioport_rw(opaque, addr, 1, value);
-return;
-}
-
-static void assigned_dev_ioport_writew(void *opaque, uint32_t addr,
-   uint32_t value)
-{
-assigned_dev_ioport_rw(opaque, addr, 2, value);
-return;
-}
-
-static void assigned_dev_ioport_writel(void *opaque, uint32_t addr,
-   uint32_t value)
-{
-assigned_dev_ioport_rw(opaque, addr, 4, value);
-return;
-}
-
-static uint32_t assigned_dev_ioport_readb(void *opaque, uint32_t addr)
-{
-return assigned_dev_ioport_rw(opaque, addr, 1, NULL);
+return val;
 }
 
-static uint32_t assigned_dev_ioport_readw(void *opaque, uint32_t addr)
+static void assigned_dev_ioport_write(void *opaque, target_phys_addr_t addr,
+  uint64_t data, unsigned size)
 {
-return assigned_dev_ioport_rw(opaque, addr, 2, NULL);
+assigned_dev_ioport_rw(opaque, addr, size, data);
 }
 
-static uint32_t assigned_dev_ioport_readl(void *opaque, uint32_t addr)
+static uint64_t 

[PATCH 3/6] pci-assign: Fix PCI_EXP_FLAGS_TYPE shift

2011-11-16 Thread Alex Williamson
Coverity found that we're doing (uint16_t)type  0xf0  8.
This is obviously always 0x0, so our attempt to filter out
some device types thinks everything is an endpoint.  Fix
shift amount.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---

 hw/device-assignment.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 571a097..ec302d2 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -1294,7 +1294,7 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
 assigned_dev_setup_cap_read(dev, pos, size);
 
 type = pci_get_word(pci_dev-config + pos + PCI_EXP_FLAGS);
-type = (type  PCI_EXP_FLAGS_TYPE)  8;
+type = (type  PCI_EXP_FLAGS_TYPE)  4;
 if (type != PCI_EXP_TYPE_ENDPOINT 
 type != PCI_EXP_TYPE_LEG_END  type != PCI_EXP_TYPE_RC_END) {
 fprintf(stderr,

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/6] pci-assign: Fix PCIe lnkcap

2011-11-16 Thread Alex Williamson
Another Coverity found issue, lnkcap is a 32bit register and
we're masking bits 16  17.  Fix to uin32_t.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---

 hw/device-assignment.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index ec302d2..dd92ce0 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -1240,8 +1240,8 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
 
 if ((pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_EXP, 0))) {
 uint8_t version, size = 0;
-uint16_t type, devctl, lnkcap, lnksta;
-uint32_t devcap;
+uint16_t type, devctl, lnksta;
+uint32_t devcap, lnkcap;
 
 version = pci_get_byte(pci_dev-config + pos + PCI_EXP_FLAGS);
 version = PCI_EXP_FLAGS_VERS;
@@ -1326,11 +1326,11 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
 pci_set_word(pci_dev-config + pos + PCI_EXP_DEVSTA, 0);
 
 /* Link capabilities, expose links and latencues, clear reporting */
-lnkcap = pci_get_word(pci_dev-config + pos + PCI_EXP_LNKCAP);
+lnkcap = pci_get_long(pci_dev-config + pos + PCI_EXP_LNKCAP);
 lnkcap = (PCI_EXP_LNKCAP_SLS | PCI_EXP_LNKCAP_MLW |
PCI_EXP_LNKCAP_ASPMS | PCI_EXP_LNKCAP_L0SEL |
PCI_EXP_LNKCAP_L1EL);
-pci_set_word(pci_dev-config + pos + PCI_EXP_LNKCAP, lnkcap);
+pci_set_long(pci_dev-config + pos + PCI_EXP_LNKCAP, lnkcap);
 pci_set_word(pci_dev-wmask + pos + PCI_EXP_LNKCAP,
  PCI_EXP_LNKCTL_ASPMC | PCI_EXP_LNKCTL_RCB |
  PCI_EXP_LNKCTL_CCC | PCI_EXP_LNKCTL_ES |

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/6] pci-assign: Remove bogus PCIe lnkcap wmask setting

2011-11-16 Thread Alex Williamson
All the fields of lnkcap are read-only and this is setting it
with mask values from LNKCTL.  Just below it, we indicate
link control is read only, so this appears to be a stray
chunk left in from development.  Trivial comment fix while
we're here.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---

 hw/device-assignment.c |6 +-
 1 files changed, 1 insertions(+), 5 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index dd92ce0..0160de7 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -1312,7 +1312,7 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
 pci_set_long(pci_dev-config + pos + PCI_EXP_DEVCAP, devcap);
 
 /* device control: clear all error reporting enable bits, leaving
- * leaving only a few host values.  Note, these are
+ * only a few host values.  Note, these are
  * all writable, but not passed to hw.
  */
 devctl = pci_get_word(pci_dev-config + pos + PCI_EXP_DEVCTL);
@@ -1331,10 +1331,6 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
PCI_EXP_LNKCAP_ASPMS | PCI_EXP_LNKCAP_L0SEL |
PCI_EXP_LNKCAP_L1EL);
 pci_set_long(pci_dev-config + pos + PCI_EXP_LNKCAP, lnkcap);
-pci_set_word(pci_dev-wmask + pos + PCI_EXP_LNKCAP,
- PCI_EXP_LNKCTL_ASPMC | PCI_EXP_LNKCTL_RCB |
- PCI_EXP_LNKCTL_CCC | PCI_EXP_LNKCTL_ES |
- PCI_EXP_LNKCTL_CLKREQ_EN | PCI_EXP_LNKCTL_HAWD);
 
 /* Link control, pass existing read-only copy.  Should be writable? */
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/6] pci-assign: Harden I/O port test

2011-11-16 Thread Alex Williamson
Markus Armbruster points out that we're missing a  0 check
from pread while trying to probe for pci-sysfs io-port
resource support.  We don't expect a short read, but we
should harden the test to abort if we get one so we're not
potentially looking at a stale errno.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---

 hw/device-assignment.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 0160de7..7e6f972 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -434,8 +434,9 @@ static int assigned_dev_register_regions(PCIRegion 
*io_regions,
  * kernels return EIO.  New kernels only allow 1/2/4 byte reads
  * so should return EINVAL for a 3 byte read */
 ret = pread(pci_dev-v_addrs[i].region-resource_fd, val, 3, 0);
-if (ret == 3) {
-fprintf(stderr, I/O port resource supports 3 byte read?!\n);
+if (ret = 0) {
+fprintf(stderr, Unexpected return from I/O port read: %d\n,
+ret);
 abort();
 } else if (errno != EINVAL) {
 fprintf(stderr, Using raw in/out ioport access (sysfs - 
%s)\n,

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-16 Thread Alex Williamson
On Tue, 2011-11-15 at 16:29 -0600, Scott Wood wrote:
 On 11/15/2011 03:40 PM, Aaron Fabbri wrote:
  
  
  
  On 11/15/11 12:10 PM, Scott Wood scottw...@freescale.com wrote:
  
  On 11/15/2011 12:34 AM, David Gibson wrote:
  snip 
  +static int allow_unsafe_intrs;
  +module_param(allow_unsafe_intrs, int, 0);
  +MODULE_PARM_DESC(allow_unsafe_intrs,
  +Allow use of IOMMUs which do not support interrupt remapping);
 
  This should not be a global option, but part of the AMD/Intel IOMMU
  specific code.  In general it's a question of how strict the IOMMU
  driver is about isolation when it determines what the groups are, and
  only the IOMMU driver can know what the possibilities are for its
  class of hardware.
 
  It's also a concern that is specific to MSIs.  In any case, I'm not sure
  that the ability to cause a spurious IRQ is bad enough to warrant
  disabling the entire subsystem by default on certain hardware.
  
  I think the issue is more that the ability to create fake MSI interrupts can
  lead to bigger exploits.
  
  Originally we didn't have this parameter. It was added it to reflect the
  fact that MSI's triggered by guests are dangerous without the isolation that
  interrupt remapping provides.
  
  That is, it *should* be inconvenient to run without interrupt mapping HW
  support.
 
 A sysfs knob is sufficient inconvenience.  It should only affect whether
 you can use MSIs, and the relevant issue shouldn't be has interrupt
 remapping but is there a hole.
 
 Some systems might address the issue in ways other than IOMMU-level MSI
 translation.  Our interrupt controller provides enough separate 4K pages
 for MSI interrupt delivery for each PCIe IOMMU group to get its own (we
 currently only have 3, one per root complex) -- no special IOMMU feature
 required.
 
 It doesn't help that the semantics of IOMMU_CAP_INTR_REMAP are
 undefined.  I shouldn't have to know how x86 IOMMUs work when
 implementing a driver for different hardware, just to know what the
 generic code is expecting.
 
 As David suggests, if you want to do this it should be the x86 IOMMU
 driver that has a knob that controls how it forms groups in the absence
 of this support.

That is a possibility, we could push it down to the iommu driver which
could simply lump everything into a single groupid when interrupt
remapping is not supported.  Or more directly, when there is an exposure
that devices can trigger random MSIs in the host.  Then we wouldn't need
an option to override this in vfio, you'd just be stuck not being able
to use any devices if you can't bind everything to vfio.  That also
eliminates the possibility of flipping it on dynamically since we can't
handle groupids changing.  Then we'd need an iommu=group_unsafe_msi flag
to enable it.  Ok?  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 11/11] KVM: PPC: Eliminate global spinlock in kvmppc_h_enter

2011-11-16 Thread Paul Mackerras
From dfd5bcfac841f8a36593edf60d9fb15e0d633287 Mon Sep 17 00:00:00 2001
From: Paul Mackerras pau...@samba.org
Date: Mon, 14 Nov 2011 13:30:38 +1100
Subject: 

Currently, kvmppc_h_enter takes a spinlock that is global to the guest,
kvm-mmu_lock, in order to check for pending PTE invalidations safely.
On some workloads, kvmppc_h_enter is called heavily and the use of a
global spinlock could compromise scalability.  We already use a per-
guest page spinlock in the form of the bit spinlock on the rmap chain,
and this gives us synchronization with the PTE invalidation side, which
also takes the bit spinlock on the rmap chain for each page being
invalidated.  Thus it is sufficient to check for pending invalidations
while the rmap chain bit spinlock is held.  However, now we require
barriers in mmu_notifier_retry() and in the places where
mmu_notifier_count and mmu_notifier_seq are updated, since we can now
call mmu_notifier_retry() concurrently with updates to those fields.

Signed-off-by: Paul Mackerras pau...@samba.org
---
Cc'd to kvm@vger.kernel.org for review of the generic kvm changes.

 arch/powerpc/include/asm/kvm_book3s_64.h |   13 +
 arch/powerpc/kvm/book3s_64_mmu_hv.c  |   19 
 arch/powerpc/kvm/book3s_hv_rm_mmu.c  |   75 -
 include/linux/kvm_host.h |   13 +++--
 virt/kvm/kvm_main.c  |4 ++
 5 files changed, 66 insertions(+), 58 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 3745337..db6cbd5 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -161,4 +161,17 @@ static inline unsigned long 
kvmppc_read_update_linux_pte(pte_t *p)
return pfn;
 }
 
+static inline void lock_rmap(unsigned long *rmap)
+{
+   do {
+   while (test_bit(KVMPPC_RMAP_LOCK_BIT, rmap))
+   cpu_relax();
+   } while (test_and_set_bit_lock(KVMPPC_RMAP_LOCK_BIT, rmap));
+}
+
+static inline void unlock_rmap(unsigned long *rmap)
+{
+   __clear_bit_unlock(KVMPPC_RMAP_LOCK_BIT, rmap);
+}
+
 #endif /* __ASM_KVM_BOOK3S_64_H__ */
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 8c497b8..bb75bfb 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -611,12 +611,6 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
goto out_put;
pfn = page_to_pfn(page);
 
-   /* Check if we might have been invalidated; let the guest retry if so */
-   ret = RESUME_GUEST;
-   spin_lock(kvm-mmu_lock);
-   if (mmu_notifier_retry(vcpu, mmu_seq))
-   goto out_unlock;
-
/* Set the HPTE to point to pfn */
ret = RESUME_GUEST;
hptep = (unsigned long *)(kvm-arch.hpt_virt + (index  4));
@@ -627,19 +621,26 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
rev-guest_rpte != hpte[2]) {
/* HPTE has been changed under us; let the guest retry */
hptep[0] = ~HPTE_V_HVLOCK;
-   goto out_unlock;
+   goto out_put;
}
hpte[0] = (hpte[0]  ~HPTE_V_ABSENT) | HPTE_V_VALID;
hpte[1] = (rev-guest_rpte  ~(HPTE_R_PP0 - pte_size)) |
(pfn  PAGE_SHIFT);
rmap = memslot-rmap[gfn - memslot-base_gfn];
+   lock_rmap(rmap);
+
+   /* Check if we might have been invalidated; let the guest retry if so */
+   ret = RESUME_GUEST;
+   if (mmu_notifier_retry(vcpu, mmu_seq)) {
+   unlock_rmap(rmap);
+   hptep[0] = ~HPTE_V_HVLOCK;
+   goto out_put;
+   }
kvmppc_add_revmap_chain(kvm, rev, rmap, index, 0);
kvmppc_modify_hpte(kvm, hptep, hpte, index);
if (page)
SetPageDirty(page);
 
- out_unlock:
-   spin_unlock(kvm-mmu_lock);
  out_put:
if (page)
put_page(page);
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 2cadd06..4070920 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -57,22 +57,16 @@ static struct kvm_memory_slot 
*builtin_gfn_to_memslot(struct kvm *kvm,
return NULL;
 }
 
-static void lock_rmap(unsigned long *rmap)
-{
-   do {
-   while (test_bit(KVMPPC_RMAP_LOCK_BIT, rmap))
-   cpu_relax();
-   } while (test_and_set_bit_lock(KVMPPC_RMAP_LOCK_BIT, rmap));
-}
-
-/* Add this HPTE into the chain for the real page */
+/*
+ * Add this HPTE into the chain for the real page.
+ * Must be called with the chain locked; it unlocks the chain.
+ */
 void kvmppc_add_revmap_chain(struct kvm *kvm, struct revmap_entry *rev,
 unsigned long *rmap, long pte_index, int realmode)
 {
struct revmap_entry *head, *tail;
unsigned long i;
 
-   

Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-16 Thread David Gibson
On Tue, Nov 15, 2011 at 11:01:28AM -0700, Alex Williamson wrote:
 On Tue, 2011-11-15 at 17:34 +1100, David Gibson wrote:
  On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
   diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
   new file mode 100644
   index 000..5866896
   --- /dev/null
   +++ b/Documentation/vfio.txt
   @@ -0,0 +1,304 @@
   +VFIO - Virtual Function I/O[1]
   +---
   +Many modern system now provide DMA and interrupt remapping facilities
   +to help ensure I/O devices behave within the boundaries they've been
   +allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d as
   +well as POWER systems with Partitionable Endpoints (PEs) and even
   +embedded powerpc systems (technology name unknown).  The VFIO driver
   +is an IOMMU/device agnostic framework for exposing direct device
   +access to userspace, in a secure, IOMMU protected environment.  In
   +other words, this allows safe, non-privileged, userspace drivers.
  
  It's perhaps worth emphasisng that safe depends on the hardware
  being sufficiently well behaved.  BenH, I know, thinks there are a
  *lot* of cards that, e.g. have debug registers that allow a backdoor
  to their own config space via MMIO, which would bypass vfio's
  filtering of config space access.  And that's before we even get into
  the varying degrees of completeness in the isolation provided by
  different IOMMUs.
 
 Fair enough.  I know Tom had emphasized well behaved in the original
 doc.  Virtual functions are probably the best indicator of well behaved.
 
   +Why do we want that?  Virtual machines often make use of direct device
   +access (device assignment) when configured for the highest possible
   +I/O performance.  From a device and host perspective, this simply turns
   +the VM into a userspace driver, with the benefits of significantly
   +reduced latency, higher bandwidth, and direct use of bare-metal device
   +drivers[2].
   +
   +Some applications, particularly in the high performance computing
   +field, also benefit from low-overhead, direct device access from
   +userspace.  Examples include network adapters (often non-TCP/IP based)
   +and compute accelerators.  Previous to VFIO, these drivers needed to
  
  s/Previous/Prior/  although that may be a .us vs .au usage thing.
 
 Same difference, AFAICT.
 
   +go through the full development cycle to become proper upstream driver,
   +be maintained out of tree, or make use of the UIO framework, which
   +has no notion of IOMMU protection, limited interrupt support, and
   +requires root privileges to access things like PCI configuration space.
   +
   +The VFIO driver framework intends to unify these, replacing both the
   +KVM PCI specific device assignment currently used as well as provide
   +a more secure, more featureful userspace driver environment than UIO.
   +
   +Groups, Devices, IOMMUs, oh my
   +---
   +
   +A fundamental component of VFIO is the notion of IOMMU groups.  IOMMUs
   +can't always distinguish transactions from each individual device in
   +the system.  Sometimes this is because of the IOMMU design, such as with
   +PEs, other times it's caused by the I/O topology, for instance a
   +PCIe-to-PCI bridge masking all devices behind it.  We call the sets of
   +devices created by these restictions IOMMU groups (or just groups for
   +this document).
   +
   +The IOMMU cannot distiguish transactions between the individual devices
   +within the group, therefore the group is the basic unit of ownership for
   +a userspace process.  Because of this, groups are also the primary
   +interface to both devices and IOMMU domains in VFIO.
   +
   +The VFIO representation of groups is created as devices are added into
   +the framework by a VFIO bus driver.  The vfio-pci module is an example
   +of a bus driver.  This module registers devices along with a set of bus
   +specific callbacks with the VFIO core.  These callbacks provide the
   +interfaces later used for device access.  As each new group is created,
   +as determined by iommu_device_group(), VFIO creates a /dev/vfio/$GROUP
   +character device.
  
  Ok.. so, the fact that it's called vfio-pci suggests that the VFIO
  bus driver is per bus type, not per bus instance.   But grouping
  constraints could be per bus instance, if you have a couple of
  different models of PCI host bridge with IOMMUs of different
  capabilities built in, for example.
 
 Yes, vfio-pci manages devices on the pci_bus_type; per type, not per bus
 instance.

Ok, how can that work.  vfio-pci is responsible for generating the
groupings, yes?  For which it needs to know the iommu/host bridge's
isolation capabilities, which vary depending on the type of host
bridge.

  IOMMUs also register drivers per bus type, not per bus
 instance.  The IOMMU driver is free to impose 

kvm-tools: can't seem to set guest_mac and KVM_GET_SUPPORTED_CPUID failed.

2011-11-16 Thread David Evensky


There was a patch (quoted below) that changed networking at the end of 
September. When I
try to set the guest_mac from the usage in the patch and an admittaly too
brief a look at the code, the guest's mac address isn't being set. I'm using:

sudo /path/to/linux-kvm/tools/kvm/kvm run -c 1 -m 256 -k /path/to/bzImage-3.0.8 
\
   -i /path/to/initramfs-host.img --console serial -p ' console=ttyS0  ' -n 
tap,guest_mac=00:11:11:11:11:11

In the guest I get:

# ifconfig eth0
eth0  Link encap:Ethernet  HWaddr 02:15:15:15:15:15  
  inet addr:192.168.122.237  Bcast:192.168.122.255  Mask:255.255.255.0
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:24 errors:0 dropped:2 overruns:0 frame:0
  TX packets:2 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000 
  RX bytes:1874 (1.8 KiB)  TX bytes:656 (656.0 B)

which is the default.

Also, when I start the guest I sometimes get the following error message:

  # kvm run -k /path/to/bzImage-3.0.8 -m 256 -c 1 --name guest-15757
KVM_GET_SUPPORTED_CPUID failed: Argument list too long

I haven't seen that before.

Thanks,
\dae

On Sat, Sep 24, 2011 at 12:17:51PM +0300, Sasha Levin wrote:
 This patch adds support for multiple network devices. The command line syntax
 changes to the following:
 
   --network/-n [mode=[tap/user/none]] [guest_ip=[guest ip]] [host_ip=
 [host_ip]] [guest_mac=[guest_mac]] [script=[script]]
 
 Each of the parameters is optional, and the config defaults to a TAP based
 networking with a random MAC.
 ...

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm-tools: can't seem to set guest_mac and KVM_GET_SUPPORTED_CPUID failed.

2011-11-16 Thread Sasha Levin
On Wed, 2011-11-16 at 16:42 -0800, David Evensky wrote:
 
 There was a patch (quoted below) that changed networking at the end of 
 September. When I
 try to set the guest_mac from the usage in the patch and an admittaly too
 brief a look at the code, the guest's mac address isn't being set. I'm using:
 
 sudo /path/to/linux-kvm/tools/kvm/kvm run -c 1 -m 256 -k 
 /path/to/bzImage-3.0.8 \
-i /path/to/initramfs-host.img --console serial -p ' console=ttyS0  ' -n 
 tap,guest_mac=00:11:11:11:11:11
 
 In the guest I get:
 
 # ifconfig eth0
 eth0  Link encap:Ethernet  HWaddr 02:15:15:15:15:15  
   inet addr:192.168.122.237  Bcast:192.168.122.255  Mask:255.255.255.0
   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
   RX packets:24 errors:0 dropped:2 overruns:0 frame:0
   TX packets:2 errors:0 dropped:0 overruns:0 carrier:0
   collisions:0 txqueuelen:1000 
   RX bytes:1874 (1.8 KiB)  TX bytes:656 (656.0 B)
 
 which is the default.

This should be '-n mode=tap,guest_mac=00:11:11:11:11:11'

It will set the right mac:

sh-2.05b# ifconfig
eth0  Link encap:Ethernet  HWaddr 00:11:11:11:11:11
[...]

 
 Also, when I start the guest I sometimes get the following error message:
 
   # kvm run -k /path/to/bzImage-3.0.8 -m 256 -c 1 --name guest-15757
 KVM_GET_SUPPORTED_CPUID failed: Argument list too long

Heh, we were talking about it couple of weeks ago, but since I couldn't
reproduce it here (it was happening to me before, but now it's gone) the
discussing died.

Could you please provide some statistics on how often it happens to you?
Also, can you try wrapping the ioctl with a 'while (1)' (theres only 1
ioctl call to KVM_GET_SUPPORTED_CPUID) and see if it would happen at
some point?

Thanks!

 I haven't seen that before.
 
 Thanks,
 \dae
 
 On Sat, Sep 24, 2011 at 12:17:51PM +0300, Sasha Levin wrote:
  This patch adds support for multiple network devices. The command line 
  syntax
  changes to the following:
  
  --network/-n [mode=[tap/user/none]] [guest_ip=[guest ip]] [host_ip=
  [host_ip]] [guest_mac=[guest_mac]] [script=[script]]
  
  Each of the parameters is optional, and the config defaults to a TAP based
  networking with a random MAC.
  ...
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 

Sasha.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm-tools: can't seem to set guest_mac and KVM_GET_SUPPORTED_CPUID failed.

2011-11-16 Thread Pekka Enberg
On Thu, Nov 17, 2011 at 8:07 AM, Sasha Levin levinsasha...@gmail.com wrote:
 Also, when I start the guest I sometimes get the following error message:

   # kvm run -k /path/to/bzImage-3.0.8 -m 256 -c 1 --name guest-15757
 KVM_GET_SUPPORTED_CPUID failed: Argument list too long

 Heh, we were talking about it couple of weeks ago, but since I couldn't
 reproduce it here (it was happening to me before, but now it's gone) the
 discussing died.

 Could you please provide some statistics on how often it happens to you?
 Also, can you try wrapping the ioctl with a 'while (1)' (theres only 1
 ioctl call to KVM_GET_SUPPORTED_CPUID) and see if it would happen at
 some point?

I'm no longer able to reproduce it here with 3.2-rc1. We could just
try the easy way out and do what Qemu does and retry for E2BIG...
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm-tools: can't seem to set guest_mac and KVM_GET_SUPPORTED_CPUID failed.

2011-11-16 Thread Sasha Levin
On Thu, 2011-11-17 at 08:53 +0200, Pekka Enberg wrote:
 On Thu, Nov 17, 2011 at 8:07 AM, Sasha Levin levinsasha...@gmail.com wrote:
  Also, when I start the guest I sometimes get the following error message:
 
# kvm run -k /path/to/bzImage-3.0.8 -m 256 -c 1 --name guest-15757
  KVM_GET_SUPPORTED_CPUID failed: Argument list too long
 
  Heh, we were talking about it couple of weeks ago, but since I couldn't
  reproduce it here (it was happening to me before, but now it's gone) the
  discussing died.
 
  Could you please provide some statistics on how often it happens to you?
  Also, can you try wrapping the ioctl with a 'while (1)' (theres only 1
  ioctl call to KVM_GET_SUPPORTED_CPUID) and see if it would happen at
  some point?
 
 I'm no longer able to reproduce it here with 3.2-rc1. We could just
 try the easy way out and do what Qemu does and retry for E2BIG...

Let's not do that :)

It'll just get uncovered again when someone decides to use
KVM_GET_SUPPORTED_CPUID somewhere else (like in Avi's cpuid patch).

I'll try going back to 3.0 later today and see if it comes back.

David, which host kernel do you use?

-- 

Sasha.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH] kvm tools, qcow: Add the support for copy-on-write clusters

2011-11-16 Thread Lan Tianyu
When meeting request to write the cluster without copied flag,
allocate a new cluster and write original data with modification
to the new cluster. This also can add support for the writing
operation of the qcow2 compressed image.

Signed-off-by: Lan Tianyu tianyu@intel.com
---
 tools/kvm/disk/qcow.c|  322 --
 tools/kvm/include/kvm/qcow.h |2 +
 2 files changed, 218 insertions(+), 106 deletions(-)

diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c
index 680b37d..2b9af73 100644
--- a/tools/kvm/disk/qcow.c
+++ b/tools/kvm/disk/qcow.c
@@ -122,9 +122,6 @@ static int cache_table(struct qcow *q, struct qcow_l2_table 
*c)
 */
lru = list_first_entry(l1t-lru_list, struct qcow_l2_table, 
list);
 
-   if (qcow_l2_cache_write(q, lru)  0)
-   goto error;
-
/* Remove the node from the cache */
rb_erase(lru-node, r);
list_del_init(lru-list);
@@ -728,35 +725,110 @@ error_free_rfb:
return NULL;
 }
 
-/*
- * QCOW file might grow during a write operation. Not only data but metadata is
- * also written at the end of the file. Therefore it is necessary to ensure
- * every write is committed to disk. Hence we use uses qcow_pwrite_sync() to
- * synchronize the in-core state of QCOW image to disk.
- *
- * We also try to restore the image to a consistent state if the metdata
- * operation fails. The two metadat operations are: level 1 and level 2 table
- * update. If either of them fails the image is truncated to a consistent 
state.
+static u16 qcow_get_refcount(struct qcow *q, u64 clust_idx)
+{
+   struct qcow_refcount_block *rfb = NULL;
+   struct qcow_header *header = q-header;
+   u64 rfb_idx;
+
+   rfb = qcow_read_refcount_block(q, clust_idx);
+   if (!rfb) {
+   pr_warning(error while reading refcount table);
+   return -1;
+   }
+
+   rfb_idx = clust_idx  (((1ULL 
+   (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1));
+
+   if (rfb_idx = rfb-size) {
+   pr_warning(L1: refcount block index out of bounds);
+   return -1;
+   }
+
+   return be16_to_cpu(rfb-entries[rfb_idx]);
+}
+
+static int update_cluster_refcount(struct qcow *q, u64 clust_idx, u16 append)
+{
+   struct qcow_refcount_block *rfb = NULL;
+   struct qcow_header *header = q-header;
+   u16 refcount;
+   u64 rfb_idx;
+
+   rfb = qcow_read_refcount_block(q, clust_idx);
+   if (!rfb) {
+   pr_warning(error while reading refcount table);
+   return -1;
+   }
+
+   rfb_idx = clust_idx  (((1ULL 
+   (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1));
+   if (rfb_idx = rfb-size) {
+   pr_warning(refcount block index out of bounds);
+   return -1;
+   }
+
+   refcount = be16_to_cpu(rfb-entries[rfb_idx]) + append;
+   rfb-entries[rfb_idx] = cpu_to_be16(refcount);
+   rfb-dirty = 1;
+
+   /*write refcount block*/
+   write_refcount_block(q, rfb);
+
+   /*update free_clust_idx since refcount becomes zero*/
+   if (!refcount  clust_idx  q-free_clust_idx)
+   q-free_clust_idx = clust_idx;
+
+   return 0;
+}
+
+/*Allocate clusters according to the size. Find a postion that
+ *can satisfy the size. free_clust_idx is initialized to zero and
+ *Record last position.
+*/
+static u64 qcow_alloc_clusters(struct qcow *q, u64 size)
+{
+   struct qcow_header *header = q-header;
+   u16 clust_refcount;
+   u32 clust_idx, i;
+   u64 clust_num;
+
+   clust_num = (size + (q-cluster_size - 1))  header-cluster_bits;
+
+again:
+   for (i = 0; i  clust_num; i++) {
+   clust_idx = q-free_clust_idx++;
+   clust_refcount = qcow_get_refcount(q, clust_idx);
+   if (clust_refcount  0)
+   return -1;
+   else if (clust_refcount  0)
+   goto again;
+   }
+
+   for (i = 0; i  clust_num; i++)
+   update_cluster_refcount(q,
+   q-free_clust_idx - clust_num + i, 1);
+
+   return (q-free_clust_idx - clust_num)  header-cluster_bits;
+}
+
+/*Get l2 table. If the table has been copied, read table directly.
+ *If the table exists, allocate a new cluster and copy the table
+ *to the new cluster.
  */
-static ssize_t qcow_write_cluster(struct qcow *q, u64 offset, void *buf, u32 
src_len)
+static int get_cluster_table(struct qcow *q, u64 offset,
+   struct qcow_l2_table **result_l2t, u64 *result_l2_index)
 {
struct qcow_header *header = q-header;
struct qcow_l1_table *l1t = q-table;
struct qcow_l2_table *l2t;
-   u64 clust_start;
-   u64 clust_flags;
-   u64 l2t_offset;
-   u64 clust_off;
-   u64 l2t_size;
-   u64 clust_sz;
u64 l1t_idx;
+   u64 l2t_offset;
   

[RFC PATCH 10/11] KVM: PPC: Implement MMU notifiers

2011-11-16 Thread Paul Mackerras
This implements the low-level functions called by the MMU notifiers in
the generic KVM code, and defines KVM_ARCH_WANT_MMU_NOTIFIER if
CONFIG_KVM_BOOK3S_64_HV so that the generic KVM MMU notifiers get
included.

That means we also have to take notice of when PTE invalidations are
in progress, as indicated by mmu_notifier_retry().  In kvmppc_h_enter,
if any invalidation is in progress we just install a non-present HPTE.
In kvmppc_book3s_hv_page_fault, if an invalidation is in progress we
just return without resolving the guest, causing it to encounter another
page fault immediately.  This is better than spinning inside
kvmppc_book3s_hv_page_fault because this way the guest can get preempted
by a hypervisor decrementer interrupt without us having to do any
special checks.

We currently maintain a referenced bit in the rmap array, and when we
clear it, we make all the HPTEs that map the corresponding page be
non-present, as if the page were invalidated.  In future we could use
the hardware reference bit in the guest HPT instead.

The kvm_set_spte_hva function is implemented as kvm_unmap_hva.  The
former appears to be unused anyway.

This all means that on processors that support virtual partition
memory (POWER7), we can claim support for the KVM_CAP_SYNC_MMU
capability, and we no longer have to pin all the guest memory.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/kvm_host.h |   13 +++
 arch/powerpc/kvm/Kconfig|1 +
 arch/powerpc/kvm/book3s_64_mmu_hv.c |  160 ++-
 arch/powerpc/kvm/book3s_hv.c|   25 +++--
 arch/powerpc/kvm/book3s_hv_rm_mmu.c |   34 ++-
 arch/powerpc/kvm/powerpc.c  |3 +
 6 files changed, 218 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 3dfac3d..79bfc69 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -44,6 +44,19 @@
 #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
 #endif
 
+#ifdef CONFIG_KVM_BOOK3S_64_HV
+#include linux/mmu_notifier.h
+
+#define KVM_ARCH_WANT_MMU_NOTIFIER
+
+struct kvm;
+extern int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
+extern int kvm_age_hva(struct kvm *kvm, unsigned long hva);
+extern int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
+extern void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
+
+#endif
+
 /* We don't currently support large pages. */
 #define KVM_HPAGE_GFN_SHIFT(x) 0
 #define KVM_NR_PAGE_SIZES  1
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 78133de..8f64709 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -69,6 +69,7 @@ config KVM_BOOK3S_64
 config KVM_BOOK3S_64_HV
bool KVM support for POWER7 and PPC970 using hypervisor mode in host
depends on KVM_BOOK3S_64
+   select MMU_NOTIFIER
---help---
  Support running unmodified book3s_64 guest kernels in
  virtual machines on POWER7 and PPC970 processors that have
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index e93c789..8c497b8 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -138,6 +138,15 @@ void kvmppc_map_vrma(struct kvm *kvm, struct 
kvm_userspace_memory_region *mem)
hp1 = hpte1_pgsize_encoding(psize) |
HPTE_R_R | HPTE_R_C | HPTE_R_M | PP_RWXX;
 
+   spin_lock(kvm-mmu_lock);
+   /* wait until no invalidations are in progress */
+   while (kvm-mmu_notifier_count) {
+   spin_unlock(kvm-mmu_lock);
+   while (kvm-mmu_notifier_count)
+   cpu_relax();
+   spin_lock(kvm-mmu_lock);
+   }
+   
for (i = 0; i  npages; ++i) {
addr = i  porder;
if (pfns) {
@@ -185,6 +194,7 @@ void kvmppc_map_vrma(struct kvm *kvm, struct 
kvm_userspace_memory_region *mem)
KVMPPC_RMAP_REFERENCED | KVMPPC_RMAP_PRESENT;
}
}
+   spin_unlock(kvm-mmu_lock);
 }
 
 int kvmppc_mmu_hv_init(void)
@@ -506,7 +516,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
struct kvm *kvm = vcpu-kvm;
struct kvmppc_slb *slbe;
unsigned long *hptep, hpte[3];
-   unsigned long psize, pte_size;
+   unsigned long mmu_seq, psize, pte_size;
unsigned long gfn, hva, pfn, amr;
struct kvm_memory_slot *memslot;
unsigned long *rmap;
@@ -581,6 +591,11 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
if (kvm-arch.slot_pfns[memslot-id])
return -EFAULT; /* should never get here */
hva = gfn_to_hva_memslot(memslot, gfn);
+
+   /* used to check for invalidations in progress */
+   mmu_seq = kvm-mmu_notifier_seq;
+   smp_rmb();
+
npages = get_user_pages_fast(hva, 

[RFC PATCH 07/11] KVM: PPC: Convert do_h_register_vpa to use Linux page tables

2011-11-16 Thread Paul Mackerras
This makes do_h_register_vpa use a new helper function,
kvmppc_pin_guest_page, to pin the page containing the virtual
processor area that the guest wants to register.  The logic of
whether to use the userspace Linux page tables or the slot_pfns
array is thus hidden in kvmppc_pin_guest_page.  There is also a
new kvmppc_unpin_guest_page to release a previously-pinned page,
which we call at VPA unregistration time, or when a new VPA is
registered, or when the vcpu is destroyed.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/kvm_book3s.h |3 ++
 arch/powerpc/kvm/book3s_64_mmu_hv.c   |   44 +++
 arch/powerpc/kvm/book3s_hv.c  |   52 ++--
 3 files changed, 83 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index bd8345f..b5ee1ce 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -141,6 +141,9 @@ extern void kvmppc_set_bat(struct kvm_vcpu *vcpu, struct 
kvmppc_bat *bat,
 extern void kvmppc_giveup_ext(struct kvm_vcpu *vcpu, ulong msr);
 extern int kvmppc_emulate_paired_single(struct kvm_run *run, struct kvm_vcpu 
*vcpu);
 extern pfn_t kvmppc_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn);
+extern void *kvmppc_pin_guest_page(struct kvm *kvm, unsigned long addr,
+   unsigned long *nb_ret);
+extern void kvmppc_unpin_guest_page(struct kvm *kvm, void *addr);
 
 extern void kvmppc_entry_trampoline(void);
 extern void kvmppc_hv_entry_trampoline(void);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 99187db..9c7e825 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -480,6 +480,50 @@ int kvmppc_book3s_hv_emulate_mmio(struct kvm_run *run, 
struct kvm_vcpu *vcpu)
return kvmppc_emulate_mmio(run, vcpu);
 }
 
+void *kvmppc_pin_guest_page(struct kvm *kvm, unsigned long gpa,
+   unsigned long *nb_ret)
+{
+   struct kvm_memory_slot *memslot;
+   unsigned long gfn = gpa  PAGE_SHIFT;
+   struct page *pages[1];
+   int npages;
+   unsigned long hva, psize, offset;
+   unsigned long pfn;
+   unsigned long *pfnp;
+
+   memslot = gfn_to_memslot(kvm, gfn);
+   if (!memslot || (memslot-flags  KVM_MEMSLOT_INVALID) ||
+   (memslot-flags  KVM_MEMSLOT_IO))
+   return NULL;
+   pfnp = kvmppc_pfn_entry(kvm, memslot, gfn);
+   if (pfnp) {
+   pfn = *pfnp;
+   if (!pfn)
+   return NULL;
+   psize = 1ul  kvm-arch.slot_page_order[memslot-id];
+   pages[0] = pfn_to_page(pfn);
+   get_page(pages[0]);
+   } else {
+   hva = gfn_to_hva_memslot(memslot, gfn);
+   npages = get_user_pages_fast(hva, 1, 1, pages);
+   if (npages  1)
+   return NULL;
+   psize = PAGE_SIZE;
+   }
+   offset = gpa  (psize - 1);
+   if (nb_ret)
+   *nb_ret = psize - offset;
+   return page_address(pages[0]) + offset;
+}
+
+void kvmppc_unpin_guest_page(struct kvm *kvm, void *va)
+{
+   struct page *page = virt_to_page(va);
+
+   page = compound_head(page);
+   put_page(page);
+}
+
 void kvmppc_mmu_book3s_hv_init(struct kvm_vcpu *vcpu)
 {
struct kvmppc_mmu *mmu = vcpu-arch.mmu;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index cb21845..ceb49d2 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -163,10 +163,10 @@ static unsigned long do_h_register_vpa(struct kvm_vcpu 
*vcpu,
   unsigned long vcpuid, unsigned long vpa)
 {
struct kvm *kvm = vcpu-kvm;
-   unsigned long ra, len;
-   unsigned long nb;
+   unsigned long len, nb;
void *va;
struct kvm_vcpu *tvcpu;
+   int err = H_PARAMETER;
 
tvcpu = kvmppc_find_vcpu(kvm, vcpuid);
if (!tvcpu)
@@ -179,40 +179,41 @@ static unsigned long do_h_register_vpa(struct kvm_vcpu 
*vcpu,
if (flags  4) {
if (vpa  0x7f)
return H_PARAMETER;
+   if (flags = 2  !tvcpu-arch.vpa)
+   return H_RESOURCE;
/* registering new area; convert logical addr to real */
-   ra = kvmppc_logical_to_real(kvm, vpa, nb);
-   if (!ra)
+   va = kvmppc_pin_guest_page(kvm, vpa, nb);
+   if (va == NULL)
return H_PARAMETER;
-   va = __va(ra);
if (flags = 1)
len = *(unsigned short *)(va + 4);
else
len = *(unsigned int *)(va + 4);
if (len  nb)
-   return H_PARAMETER;
+   goto out_unpin;
switch 

[PATCH 02/11] KVM: PPC: Keep a record of HV guest view of hashed page table entries

2011-11-16 Thread Paul Mackerras
This adds an array that parallels the guest hashed page table (HPT),
that is, it has one entry per HPTE, used to store the guest's view
of the second doubleword of the corresponding HPTE.  The first
doubleword in the HPTE is the same as the guest's idea of it, so we
don't need to store a copy, but the second doubleword in the HPTE has
the real page number rather than the guest's logical page number.
This allows us to remove the back_translate() and reverse_xlate()
functions.

This reverse mapping array is vmalloc'd, meaning that to access it
in real mode we have to walk the kernel's page tables explicitly.
That is done by the new real_vmalloc_addr() function.  (In fact this
returns an address in the linear mapping, so the result is usable
both in real mode and in virtual mode.)

This also corrects a couple of bugs in kvmppc_mmu_get_pp_value().

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/kvm_book3s_64.h |   20 +
 arch/powerpc/include/asm/kvm_host.h  |   10 ++
 arch/powerpc/kvm/book3s_64_mmu_hv.c  |  136 +-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c  |   95 +
 4 files changed, 147 insertions(+), 114 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 53692c2..63542dd 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -29,6 +29,14 @@ static inline struct kvmppc_book3s_shadow_vcpu 
*to_svcpu(struct kvm_vcpu *vcpu)
 
 #define SPAPR_TCE_SHIFT12
 
+#ifdef CONFIG_KVM_BOOK3S_64_HV
+/* For now use fixed-size 16MB page table */
+#define HPT_ORDER  24
+#define HPT_NPTEG  (1ul  (HPT_ORDER - 7))/* 128B per pteg */
+#define HPT_NPTE   (HPT_NPTEG  3)/* 8 PTEs per PTEG */
+#define HPT_HASH_MASK  (HPT_NPTEG - 1)
+#endif
+
 static inline unsigned long compute_tlbie_rb(unsigned long v, unsigned long r,
 unsigned long pte_index)
 {
@@ -86,4 +94,16 @@ static inline long try_lock_hpte(unsigned long *hpte, 
unsigned long bits)
return old == 0;
 }
 
+static inline unsigned long hpte_page_size(unsigned long h, unsigned long l)
+{
+   /* only handle 4k, 64k and 16M pages for now */
+   if (!(h  HPTE_V_LARGE))
+   return 1ul  12;   /* 4k page */
+   if ((l  0xf000) == 0x1000  cpu_has_feature(CPU_FTR_ARCH_206))
+   return 1ul  16;   /* 64k page */
+   if ((l  0xff000) == 0)
+   return 1ul  24;   /* 16M page */
+   return 0;   /* error */
+}
+
 #endif /* __ASM_KVM_BOOK3S_64_H__ */
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index f142a2d..56f7046 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -166,9 +166,19 @@ struct kvmppc_rma_info {
atomic_t use_count;
 };
 
+/*
+ * The reverse mapping array has one entry for each HPTE,
+ * which stores the guest's view of the second word of the HPTE
+ * (including the guest physical address of the mapping).
+ */
+struct revmap_entry {
+   unsigned long guest_rpte;
+};
+
 struct kvm_arch {
 #ifdef CONFIG_KVM_BOOK3S_64_HV
unsigned long hpt_virt;
+   struct revmap_entry *revmap;
unsigned long ram_npages;
unsigned long ram_psize;
unsigned long ram_porder;
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index da8c2f4..2b9b8be 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -23,6 +23,7 @@
 #include linux/gfp.h
 #include linux/slab.h
 #include linux/hugetlb.h
+#include linux/vmalloc.h
 
 #include asm/tlbflush.h
 #include asm/kvm_ppc.h
@@ -33,11 +34,6 @@
 #include asm/ppc-opcode.h
 #include asm/cputable.h
 
-/* For now use fixed-size 16MB page table */
-#define HPT_ORDER  24
-#define HPT_NPTEG  (1ul  (HPT_ORDER - 7))/* 128B per pteg */
-#define HPT_HASH_MASK  (HPT_NPTEG - 1)
-
 /* Pages in the VRMA are 16MB pages */
 #define VRMA_PAGE_ORDER24
 #define VRMA_VSID  0x1ffUL /* 1TB VSID reserved for VRMA */
@@ -51,7 +47,9 @@ long kvmppc_alloc_hpt(struct kvm *kvm)
 {
unsigned long hpt;
unsigned long lpid;
+   struct revmap_entry *rev;
 
+   /* Allocate guest's hashed page table */
hpt = __get_free_pages(GFP_KERNEL|__GFP_ZERO|__GFP_REPEAT|__GFP_NOWARN,
   HPT_ORDER - PAGE_SHIFT);
if (!hpt) {
@@ -60,12 +58,20 @@ long kvmppc_alloc_hpt(struct kvm *kvm)
}
kvm-arch.hpt_virt = hpt;
 
+   /* Allocate reverse map array */
+   rev = vmalloc(sizeof(struct revmap_entry) * HPT_NPTE);
+   if (!rev) {
+   pr_err(kvmppc_alloc_hpt: Couldn't alloc reverse map array\n);
+   goto out_freehpt;
+   }
+   

[RFC PATCH 08/11] KVM: PPC: Add a page fault handler function

2011-11-16 Thread Paul Mackerras
This adds a kvmppc_book3s_hv_page_fault function that is capable of
handling the fault we get if the guest tries to access a non-present
page (one that we have marked with storage key 31 and no-execute),
and either doing MMIO emulation, or making the page resident and
rewriting the guest HPTE to point to it, if it is RAM.

We now call this for hypervisor instruction storage interrupts, and
for hypervisor data storage interrupts instead of the emulate-MMIO
function.  It can now be called for real-mode accesses through the
VRMA as well as virtual-mode accesses.

In order to identify non-present HPTEs, we use a second software-use
bit in the first dword of the HPTE, called HPTE_V_ABSENT.  We can't
just look for storage key 31 because non-present HPTEs for the VRMA
have to be actually invalid, as the storage key mechanism doesn't
operate in real mode.  Using this bit also means that we don't have
to restrict the guest from using key 31 any more.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/kvm_book3s.h|6 +-
 arch/powerpc/include/asm/kvm_book3s_64.h |   11 ++-
 arch/powerpc/include/asm/kvm_host.h  |   30 ++--
 arch/powerpc/kvm/book3s_64_mmu_hv.c  |  259 +++---
 arch/powerpc/kvm/book3s_hv.c |   54 --
 arch/powerpc/kvm/book3s_hv_rm_mmu.c  |  121 --
 6 files changed, 340 insertions(+), 141 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index b5ee1ce..ac48438 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -121,7 +121,9 @@ extern void kvmppc_mmu_book3s_hv_init(struct kvm_vcpu 
*vcpu);
 extern int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct kvmppc_pte *pte);
 extern int kvmppc_mmu_map_segment(struct kvm_vcpu *vcpu, ulong eaddr);
 extern void kvmppc_mmu_flush_segments(struct kvm_vcpu *vcpu);
-extern int kvmppc_book3s_hv_emulate_mmio(struct kvm_run *run, struct kvm_vcpu 
*vcpu);
+extern int kvmppc_book3s_hv_page_fault(struct kvm_run *run,
+   struct kvm_vcpu *vcpu, unsigned long addr,
+   unsigned long status);
 
 extern void kvmppc_mmu_hpte_cache_map(struct kvm_vcpu *vcpu, struct hpte_cache 
*pte);
 extern struct hpte_cache *kvmppc_mmu_hpte_cache_next(struct kvm_vcpu *vcpu);
@@ -141,6 +143,8 @@ extern void kvmppc_set_bat(struct kvm_vcpu *vcpu, struct 
kvmppc_bat *bat,
 extern void kvmppc_giveup_ext(struct kvm_vcpu *vcpu, ulong msr);
 extern int kvmppc_emulate_paired_single(struct kvm_run *run, struct kvm_vcpu 
*vcpu);
 extern pfn_t kvmppc_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn);
+extern void kvmppc_modify_hpte(struct kvm *kvm, unsigned long *hptep,
+   unsigned long new_hpte[2], unsigned long pte_index);
 extern void *kvmppc_pin_guest_page(struct kvm *kvm, unsigned long addr,
unsigned long *nb_ret);
 extern void kvmppc_unpin_guest_page(struct kvm *kvm, void *addr);
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 307e649..3745337 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -37,6 +37,8 @@ static inline struct kvmppc_book3s_shadow_vcpu 
*to_svcpu(struct kvm_vcpu *vcpu)
 #define HPT_HASH_MASK  (HPT_NPTEG - 1)
 #endif
 
+#define VRMA_VSID  0x1ffUL /* 1TB VSID reserved for VRMA */
+
 static inline unsigned long compute_tlbie_rb(unsigned long v, unsigned long r,
 unsigned long pte_index)
 {
@@ -72,9 +74,11 @@ static inline unsigned long compute_tlbie_rb(unsigned long 
v, unsigned long r,
 
 /*
  * We use a lock bit in HPTE dword 0 to synchronize updates and
- * accesses to each HPTE.
+ * accesses to each HPTE, and another bit to indicate non-present
+ * HPTEs.
  */
 #define HPTE_V_HVLOCK  0x40UL
+#define HPTE_V_ABSENT  0x20UL
 
 static inline long try_lock_hpte(unsigned long *hpte, unsigned long bits)
 {
@@ -106,6 +110,11 @@ static inline unsigned long hpte_page_size(unsigned long 
h, unsigned long l)
return 0;   /* error */
 }
 
+static inline unsigned long hpte_rpn(unsigned long ptel, unsigned long psize)
+{
+   return ((ptel  HPTE_R_RPN)  ~(psize - 1))  PAGE_SHIFT;
+}
+
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 static inline unsigned long *kvmppc_pfn_entry(struct kvm *kvm,
struct kvm_memory_slot *memslot, unsigned long gfn)
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index f211643..ababf17 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -162,6 +162,20 @@ struct kvmppc_rma_info {
atomic_t use_count;
 };
 
+struct kvmppc_slb {
+   u64 esid;
+   u64 vsid;
+   u64 orige;
+   u64 origv;
+   bool valid  : 1;
+   bool Ks : 1;
+   bool Kp : 1;
+   

[PATCH 03/11] KVM: PPC: Allow use of small pages to back guest memory

2011-11-16 Thread Paul Mackerras
From: Nishanth Aravamudan n...@us.ibm.com

This puts the page frame numbers for the memory backing the guest in
the slot-rmap array for each slot, rather than using the ram_pginfo
array.  Since the rmap array is vmalloc'd, we use real_vmalloc_addr()
to access it when we access it in real mode in kvmppc_h_enter().
The rmap array contains one PFN for each small page, even if the
backing memory is large pages.

This lets us get rid of the ram_pginfo array.

[pau...@samba.org - Cleaned up and reorganized a bit, abstracted out
HPTE page size encoding functions, added check that memory being
added in kvmppc_core_prepare_memory_region is all in one VMA.]

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/kvm_host.h |8 --
 arch/powerpc/kvm/book3s_64_mmu_hv.c |   47 +++
 arch/powerpc/kvm/book3s_hv.c|  153 +--
 arch/powerpc/kvm/book3s_hv_rm_mmu.c |   90 ++--
 4 files changed, 151 insertions(+), 147 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 56f7046..52fd741 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -145,11 +145,6 @@ struct kvmppc_exit_timing {
};
 };
 
-struct kvmppc_pginfo {
-   unsigned long pfn;
-   atomic_t refcnt;
-};
-
 struct kvmppc_spapr_tce_table {
struct list_head list;
struct kvm *kvm;
@@ -179,17 +174,14 @@ struct kvm_arch {
 #ifdef CONFIG_KVM_BOOK3S_64_HV
unsigned long hpt_virt;
struct revmap_entry *revmap;
-   unsigned long ram_npages;
unsigned long ram_psize;
unsigned long ram_porder;
-   struct kvmppc_pginfo *ram_pginfo;
unsigned int lpid;
unsigned int host_lpid;
unsigned long host_lpcr;
unsigned long sdr1;
unsigned long host_sdr1;
int tlbie_lock;
-   int n_rma_pages;
unsigned long lpcr;
unsigned long rmor;
struct kvmppc_rma_info *rma;
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 2b9b8be..bed6c61 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -34,8 +34,6 @@
 #include asm/ppc-opcode.h
 #include asm/cputable.h
 
-/* Pages in the VRMA are 16MB pages */
-#define VRMA_PAGE_ORDER24
 #define VRMA_VSID  0x1ffUL /* 1TB VSID reserved for VRMA */
 
 /* POWER7 has 10-bit LPIDs, PPC970 has 6-bit LPIDs */
@@ -95,19 +93,33 @@ void kvmppc_free_hpt(struct kvm *kvm)
free_pages(kvm-arch.hpt_virt, HPT_ORDER - PAGE_SHIFT);
 }
 
+/* Bits in first HPTE dword for pagesize 4k, 64k or 16M */
+static inline unsigned long hpte0_pgsize_encoding(unsigned long pgsize)
+{
+   return (pgsize  0x1000) ? HPTE_V_LARGE : 0;
+}
+
+/* Bits in second HPTE dword for pagesize 4k, 64k or 16M */
+static inline unsigned long hpte1_pgsize_encoding(unsigned long pgsize)
+{
+   return (pgsize == 0x1) ? 0x1000 : 0;
+}
+
 void kvmppc_map_vrma(struct kvm *kvm, struct kvm_userspace_memory_region *mem)
 {
unsigned long i;
-   unsigned long npages = kvm-arch.ram_npages;
+   unsigned long npages;
unsigned long pfn;
unsigned long *hpte;
-   unsigned long hash;
+   unsigned long addr, hash;
+   unsigned long psize = kvm-arch.ram_psize;
unsigned long porder = kvm-arch.ram_porder;
struct revmap_entry *rev;
-   struct kvmppc_pginfo *pginfo = kvm-arch.ram_pginfo;
+   struct kvm_memory_slot *memslot;
+   unsigned long hp0, hp1;
 
-   if (!pginfo)
-   return;
+   memslot = kvm-memslots-memslots[mem-slot];
+   npages = memslot-npages  (porder - PAGE_SHIFT);
 
/* VRMA can't be  1TB */
if (npages  1ul  (40 - porder))
@@ -116,10 +128,16 @@ void kvmppc_map_vrma(struct kvm *kvm, struct 
kvm_userspace_memory_region *mem)
if (npages  HPT_NPTEG)
npages = HPT_NPTEG;
 
+   hp0 = HPTE_V_1TB_SEG | (VRMA_VSID  (40 - 16)) |
+   HPTE_V_BOLTED | hpte0_pgsize_encoding(psize) | HPTE_V_VALID;
+   hp1 = hpte1_pgsize_encoding(psize) |
+   HPTE_R_R | HPTE_R_C | HPTE_R_M | PP_RWXX;
+
for (i = 0; i  npages; ++i) {
-   pfn = pginfo[i].pfn;
+   pfn = memslot-rmap[i  (porder - PAGE_SHIFT)];
if (!pfn)
-   break;
+   continue;
+   addr = i  porder;
/* can't use hpt_hash since va  64 bits */
hash = (i ^ (VRMA_VSID ^ (VRMA_VSID  25)))  HPT_HASH_MASK;
/*
@@ -131,17 +149,14 @@ void kvmppc_map_vrma(struct kvm *kvm, struct 
kvm_userspace_memory_region *mem)
hash = (hash  3) + 7;
hpte = (unsigned long *) (kvm-arch.hpt_virt + (hash  4));
/* HPTE low word - RPN, protection, etc. */
-   hpte[1] = (pfn  PAGE_SHIFT) | HPTE_R_R | HPTE_R_C |
-

[RFC PATCH 06/11] KVM: PPC: Use Linux page tables in h_enter and map_vrma

2011-11-16 Thread Paul Mackerras
This changes kvmppc_h_enter() and kvmppc_map_vrma to get the real page
numbers that they put into the guest HPT from the Linux page tables
for our userspace as an alternative to getting them from the slot_pfns
arrays.  In future this will enable us to avoid pinning all of guest
memory on POWER7, but we will still have to pin all guest memory on
PPC970 as it doesn't support virtual partition memory.

This also exports find_linux_pte_or_hugepte() since we need it when
KVM is modular.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/kvm_book3s_64.h |   31 +++
 arch/powerpc/include/asm/kvm_host.h  |2 +
 arch/powerpc/kvm/book3s_64_mmu_hv.c  |   26 +-
 arch/powerpc/kvm/book3s_hv.c |1 +
 arch/powerpc/kvm/book3s_hv_rm_mmu.c  |  127 --
 arch/powerpc/mm/hugetlbpage.c|2 +
 6 files changed, 125 insertions(+), 64 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 9243f35..307e649 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -121,4 +121,35 @@ static inline unsigned long *kvmppc_pfn_entry(struct kvm 
*kvm,
 }
 #endif /* CONFIG_KVM_BOOK3S_64_HV */
 
+/*
+ * Lock and read a linux PTE.  If it's present and writable, atomically
+ * set dirty and referenced bits and return the PFN, otherwise return 0.
+ */
+static inline unsigned long kvmppc_read_update_linux_pte(pte_t *p)
+{
+   pte_t pte, tmp;
+   unsigned long pfn = 0;
+
+   /* wait until _PAGE_BUSY is clear then set it atomically */
+   __asm__ __volatile__ (
+   1: ldarx   %0,0,%3\n
+  andi.   %1,%0,%4\n
+  bne-1b\n
+  ori %1,%0,%4\n
+  stdcx.  %1,0,%3\n
+  bne-1b
+   : =r (pte), =r (tmp), =m (*p)
+   : r (p), i (_PAGE_BUSY)
+   : cc);
+
+   if (pte_present(pte)  pte_write(pte)) {
+   pfn = pte_pfn(pte);
+   pte = pte_mkdirty(pte_mkyoung(pte));
+   }
+
+   *p = pte;   /* clears _PAGE_BUSY */
+
+   return pfn;
+}
+
 #endif /* __ASM_KVM_BOOK3S_64_H__ */
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 93b7e04..f211643 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -32,6 +32,7 @@
 #include linux/atomic.h
 #include asm/kvm_asm.h
 #include asm/processor.h
+#include asm/page.h
 
 #define KVM_MAX_VCPUS  NR_CPUS
 #define KVM_MAX_VCORES NR_CPUS
@@ -432,6 +433,7 @@ struct kvm_vcpu_arch {
struct list_head run_list;
struct task_struct *run_task;
struct kvm_run *kvm_run;
+   pgd_t *pgdir;
 #endif
 };
 
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 4d558c4..99187db 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -111,13 +111,15 @@ void kvmppc_map_vrma(struct kvm *kvm, struct 
kvm_userspace_memory_region *mem)
unsigned long npages;
unsigned long pfn;
unsigned long *hpte;
-   unsigned long addr, hash;
+   unsigned long addr, hash, hva;
unsigned long psize;
int porder;
struct revmap_entry *rev;
struct kvm_memory_slot *memslot;
unsigned long hp0, hp1;
unsigned long *pfns;
+   pte_t *p;
+   unsigned int shift;
 
memslot = kvm-memslots-memslots[mem-slot];
pfns = kvm-arch.slot_pfns[mem-slot];
@@ -138,10 +140,26 @@ void kvmppc_map_vrma(struct kvm *kvm, struct 
kvm_userspace_memory_region *mem)
HPTE_R_R | HPTE_R_C | HPTE_R_M | PP_RWXX;
 
for (i = 0; i  npages; ++i) {
-   pfn = pfns[i];
-   if (!pfn)
-   continue;
addr = i  porder;
+   if (pfns) {
+   pfn = pfns[i];
+   } else {
+   pfn = 0;
+   local_irq_disable();
+   hva = addr + mem-userspace_addr;
+   p = find_linux_pte_or_hugepte(current-mm-pgd, hva,
+ shift);
+   if (p  (psize == PAGE_SIZE || shift == porder))
+   pfn = kvmppc_read_update_linux_pte(p);
+   local_irq_enable();
+   }
+
+   if (!pfn) {
+   pr_err(KVM: Couldn't find page for VRMA at %lx\n,
+  addr);
+   break;
+   }
+
/* can't use hpt_hash since va  64 bits */
hash = (i ^ (VRMA_VSID ^ (VRMA_VSID  25)))  HPT_HASH_MASK;
/*
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 7434258..cb21845 100644

[RFC PATCH 09/11] KVM: PPC: Maintain a doubly-linked list of guest HPTEs for each gfn

2011-11-16 Thread Paul Mackerras
This expands the reverse mapping array to contain two links for each
HPTE which are used to link together HPTEs that correspond to the
same guest logical page.  Each circular list of HPTEs is pointed to
by the rmap array entry for the guest logical page, pointed to by
the relevant memslot.  Links are 32-bit HPT entry indexes rather than
full 64-bit pointers, to save space.  We use 3 of the remaining 32
bits in the rmap array entries as a lock bit, a referenced bit and
a present bit (the present bit is needed since HPTE index 0 is valid).
The bit lock for the rmap chain nests inside the HPTE lock bit.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/kvm_book3s.h |2 +
 arch/powerpc/include/asm/kvm_host.h   |   17 ++-
 arch/powerpc/kvm/book3s_64_mmu_hv.c   |8 +++
 arch/powerpc/kvm/book3s_hv_rm_mmu.c   |   88 -
 4 files changed, 113 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index ac48438..8454a82 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -143,6 +143,8 @@ extern void kvmppc_set_bat(struct kvm_vcpu *vcpu, struct 
kvmppc_bat *bat,
 extern void kvmppc_giveup_ext(struct kvm_vcpu *vcpu, ulong msr);
 extern int kvmppc_emulate_paired_single(struct kvm_run *run, struct kvm_vcpu 
*vcpu);
 extern pfn_t kvmppc_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn);
+extern void kvmppc_add_revmap_chain(struct kvm *kvm, struct revmap_entry *rev,
+   unsigned long *rmap, long pte_index, int realmode);
 extern void kvmppc_modify_hpte(struct kvm *kvm, unsigned long *hptep,
unsigned long new_hpte[2], unsigned long pte_index);
 extern void *kvmppc_pin_guest_page(struct kvm *kvm, unsigned long addr,
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index ababf17..3dfac3d 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -179,12 +179,27 @@ struct kvmppc_slb {
 /*
  * The reverse mapping array has one entry for each HPTE,
  * which stores the guest's view of the second word of the HPTE
- * (including the guest physical address of the mapping).
+ * (including the guest physical address of the mapping),
+ * plus forward and backward pointers in a doubly-linked ring
+ * of HPTEs that map the same host page.  The pointers in this
+ * ring are 32-bit HPTE indexes, to save space.
  */
 struct revmap_entry {
unsigned long guest_rpte;
+   unsigned int forw, back;
 };
 
+/*
+ * We use the top bit of each memslot-rmap entry as a lock bit,
+ * and bit 32 as a present flag.  The bottom 32 bits are the
+ * index in the guest HPT of a HPTE that points to the page.
+ */
+#define KVMPPC_RMAP_LOCK_BIT   63
+#define KVMPPC_RMAP_REF_BIT33
+#define KVMPPC_RMAP_REFERENCED (1ul  KVMPPC_RMAP_REF_BIT)
+#define KVMPPC_RMAP_PRESENT0x1ul
+#define KVMPPC_RMAP_INDEX  0xul
+
 struct kvm_arch {
 #ifdef CONFIG_KVM_BOOK3S_64_HV
unsigned long hpt_virt;
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 32c7d8c..e93c789 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -179,6 +179,11 @@ void kvmppc_map_vrma(struct kvm *kvm, struct 
kvm_userspace_memory_region *mem)
/* Reverse map info */
rev = kvm-arch.revmap[hash];
rev-guest_rpte = hp1 | addr;
+   if (pfn) {
+   rev-forw = rev-back = hash;
+   memslot-rmap[i  (porder - PAGE_SHIFT)] = hash |
+   KVMPPC_RMAP_REFERENCED | KVMPPC_RMAP_PRESENT;
+   }
}
 }
 
@@ -504,6 +509,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
unsigned long psize, pte_size;
unsigned long gfn, hva, pfn, amr;
struct kvm_memory_slot *memslot;
+   unsigned long *rmap;
struct revmap_entry *rev;
struct page *page, *pages[1];
unsigned int pp, ok;
@@ -605,6 +611,8 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
hpte[0] = (hpte[0]  ~HPTE_V_ABSENT) | HPTE_V_VALID;
hpte[1] = (rev-guest_rpte  ~(HPTE_R_PP0 - pte_size)) |
(pfn  PAGE_SHIFT);
+   rmap = memslot-rmap[gfn - memslot-base_gfn];
+   kvmppc_add_revmap_chain(kvm, rev, rmap, index, 0);
kvmppc_modify_hpte(kvm, hptep, hpte, index);
if (page)
SetPageDirty(page);
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index b477e68..622bfcd 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -57,6 +57,77 @@ static struct kvm_memory_slot *builtin_gfn_to_memslot(struct 
kvm *kvm,
return NULL;
 }
 
+static void lock_rmap(unsigned long *rmap)

[PATCH 05/11] KVM: PPC: Use a separate vmalloc'd array to store pfns

2011-11-16 Thread Paul Mackerras
This changes the book3s_hv code to store the page frame numbers in
a separate vmalloc'd array, pointed to by an array in struct kvm_arch,
rather than the memslot-rmap arrays.  This frees up the rmap arrays
to be used later to store reverse mapping information.  For large page
regions, we now store only one pfn per large page rather than one pfn
per small page.  This reduces the size of the pfns arrays and eliminates
redundant get_page and put_page calls.

We also now pin the guest pages and store the pfns in the commit_memory
function rather than the prepare_memory function.  This avoids a memory
leak should the add memory procedure hit an error after calling the
prepare_memory function.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/kvm_book3s_64.h |   15 
 arch/powerpc/include/asm/kvm_host.h  |4 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c  |   10 ++-
 arch/powerpc/kvm/book3s_hv.c |  124 +++---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c  |   14 ++--
 5 files changed, 112 insertions(+), 55 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 63542dd..9243f35 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -106,4 +106,19 @@ static inline unsigned long hpte_page_size(unsigned long 
h, unsigned long l)
return 0;   /* error */
 }
 
+#ifdef CONFIG_KVM_BOOK3S_64_HV
+static inline unsigned long *kvmppc_pfn_entry(struct kvm *kvm,
+   struct kvm_memory_slot *memslot, unsigned long gfn)
+{
+   int id = memslot-id;
+   unsigned long index;
+
+   if (!kvm-arch.slot_pfns[id])
+   return NULL;
+   index = gfn - memslot-base_gfn;
+   index = kvm-arch.slot_page_order[id] - PAGE_SHIFT;
+   return kvm-arch.slot_pfns[id][index];
+}
+#endif /* CONFIG_KVM_BOOK3S_64_HV */
+
 #endif /* __ASM_KVM_BOOK3S_64_H__ */
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index e0751e5..93b7e04 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -174,8 +174,6 @@ struct kvm_arch {
 #ifdef CONFIG_KVM_BOOK3S_64_HV
unsigned long hpt_virt;
struct revmap_entry *revmap;
-   unsigned long ram_psize;
-   unsigned long ram_porder;
unsigned int lpid;
unsigned int host_lpid;
unsigned long host_lpcr;
@@ -186,6 +184,8 @@ struct kvm_arch {
unsigned long rmor;
struct kvmppc_rma_info *rma;
struct list_head spapr_tce_tables;
+   unsigned long *slot_pfns[KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS];
+   int slot_page_order[KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS];
unsigned short last_vcpu[NR_CPUS];
struct kvmppc_vcore *vcores[KVM_MAX_VCORES];
 #endif /* CONFIG_KVM_BOOK3S_64_HV */
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index bed6c61..4d558c4 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -112,13 +112,17 @@ void kvmppc_map_vrma(struct kvm *kvm, struct 
kvm_userspace_memory_region *mem)
unsigned long pfn;
unsigned long *hpte;
unsigned long addr, hash;
-   unsigned long psize = kvm-arch.ram_psize;
-   unsigned long porder = kvm-arch.ram_porder;
+   unsigned long psize;
+   int porder;
struct revmap_entry *rev;
struct kvm_memory_slot *memslot;
unsigned long hp0, hp1;
+   unsigned long *pfns;
 
memslot = kvm-memslots-memslots[mem-slot];
+   pfns = kvm-arch.slot_pfns[mem-slot];
+   porder = kvm-arch.slot_page_order[mem-slot];
+   psize = 1ul  porder;
npages = memslot-npages  (porder - PAGE_SHIFT);
 
/* VRMA can't be  1TB */
@@ -134,7 +138,7 @@ void kvmppc_map_vrma(struct kvm *kvm, struct 
kvm_userspace_memory_region *mem)
HPTE_R_R | HPTE_R_C | HPTE_R_M | PP_RWXX;
 
for (i = 0; i  npages; ++i) {
-   pfn = memslot-rmap[i  (porder - PAGE_SHIFT)];
+   pfn = pfns[i];
if (!pfn)
continue;
addr = i  porder;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 48a0648..7434258 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -133,16 +133,40 @@ static void init_vpa(struct kvm_vcpu *vcpu, struct lppaca 
*vpa)
vpa-yield_count = 1;
 }
 
+unsigned long kvmppc_logical_to_real(struct kvm *kvm, unsigned long gpa,
+unsigned long *nb_ret)
+{
+   struct kvm_memory_slot *memslot;
+   unsigned long gfn, ra, offset;
+   unsigned long *pfnp;
+   unsigned long pg_size;
+
+   gfn = gpa  PAGE_SHIFT;
+   memslot = gfn_to_memslot(kvm, gfn);
+   if (!memslot || (memslot-flags  KVM_MEMSLOT_INVALID))
+   

[PATCH 01/11] KVM: PPC: Add memory-mapping support for PCI passthrough and emulation

2011-11-16 Thread Paul Mackerras
From: Benjamin Herrenschmidt b...@kernel.crashing.org

This adds support for adding PCI device I/O regions to the guest memory
map, and for trapping guest accesses to emulated MMIO regions and
delivering them to qemu for MMIO emulation.  To trap guest accesses to
emulated MMIO regions, we reserve key 31 for the hypervisor's use and
set the VPM1 bit in LPCR, which sends all page faults to the host.
Any page fault that is not a key fault gets reflected immediately to the
guest.  We set HPTEs for emulated MMIO regions to have key = 31, and
don't allow the guest to create HPTEs with key = 31.  Any page fault
that is a key fault with key = 31 is then a candidate for MMIO
emulation and thus gets sent up to qemu.  We also load the instruction
that caused the fault for use later when qemu has done the emulation.

[pau...@samba.org: Cleaned up, moved kvmppc_book3s_hv_emulate_mmio()
 to book3s_64_mmu_hv.c]

Signed-off-by: Benjamin Herrenschmidt b...@kernel.crashing.org
Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/kvm_book3s.h|1 +
 arch/powerpc/include/asm/kvm_book3s_64.h |   24 +++
 arch/powerpc/include/asm/kvm_host.h  |2 +
 arch/powerpc/include/asm/kvm_ppc.h   |1 +
 arch/powerpc/include/asm/reg.h   |4 +
 arch/powerpc/kernel/exceptions-64s.S |8 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c  |  301 +-
 arch/powerpc/kvm/book3s_hv.c |   91 +++--
 arch/powerpc/kvm/book3s_hv_rm_mmu.c  |  153 
 arch/powerpc/kvm/book3s_hv_rmhandlers.S  |  131 -
 arch/powerpc/kvm/book3s_pr.c |1 +
 arch/powerpc/kvm/booke.c |1 +
 arch/powerpc/kvm/powerpc.c   |2 +-
 include/linux/kvm.h  |3 +
 14 files changed, 656 insertions(+), 67 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index deb8a4e..bd8345f 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -121,6 +121,7 @@ extern void kvmppc_mmu_book3s_hv_init(struct kvm_vcpu 
*vcpu);
 extern int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct kvmppc_pte *pte);
 extern int kvmppc_mmu_map_segment(struct kvm_vcpu *vcpu, ulong eaddr);
 extern void kvmppc_mmu_flush_segments(struct kvm_vcpu *vcpu);
+extern int kvmppc_book3s_hv_emulate_mmio(struct kvm_run *run, struct kvm_vcpu 
*vcpu);
 
 extern void kvmppc_mmu_hpte_cache_map(struct kvm_vcpu *vcpu, struct hpte_cache 
*pte);
 extern struct hpte_cache *kvmppc_mmu_hpte_cache_next(struct kvm_vcpu *vcpu);
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index d0ac94f..53692c2 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -62,4 +62,28 @@ static inline unsigned long compute_tlbie_rb(unsigned long 
v, unsigned long r,
return rb;
 }
 
+/*
+ * We use a lock bit in HPTE dword 0 to synchronize updates and
+ * accesses to each HPTE.
+ */
+#define HPTE_V_HVLOCK  0x40UL
+
+static inline long try_lock_hpte(unsigned long *hpte, unsigned long bits)
+{
+   unsigned long tmp, old;
+
+   asm volatile(  ldarx   %0,0,%2\n
+  and.%1,%0,%3\n
+  bne 2f\n
+  ori %0,%0,%4\n
+  stdcx.  %0,0,%2\n
+  beq+2f\n
+  li  %1,%3\n
+2:isync
+: =r (tmp), =r (old)
+: r (hpte), r (bits), i (HPTE_V_HVLOCK)
+: cc, memory);
+   return old == 0;
+}
+
 #endif /* __ASM_KVM_BOOK3S_64_H__ */
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index bf8af5d..f142a2d 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -186,6 +186,8 @@ struct kvm_arch {
struct list_head spapr_tce_tables;
unsigned short last_vcpu[NR_CPUS];
struct kvmppc_vcore *vcores[KVM_MAX_VCORES];
+   unsigned long io_slot_pfn[KVM_MEMORY_SLOTS +
+ KVM_PRIVATE_MEM_SLOTS];
 #endif /* CONFIG_KVM_BOOK3S_64_HV */
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index a284f20..8c372b9 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -132,6 +132,7 @@ extern void kvm_release_rma(struct kvmppc_rma_info *ri);
 extern int kvmppc_core_init_vm(struct kvm *kvm);
 extern void kvmppc_core_destroy_vm(struct kvm *kvm);
 extern int kvmppc_core_prepare_memory_region(struct kvm *kvm,
+   struct kvm_memory_slot *memslot,
struct kvm_userspace_memory_region *mem);
 extern void kvmppc_core_commit_memory_region(struct kvm *kvm,
struct kvm_userspace_memory_region *mem);
diff