Re: [PATCH 0/4] KVM: Dirty logging optimization using rmap
Adding qemu-devel to Cc. (2011/11/14 21:39), Avi Kivity wrote: On 11/14/2011 12:56 PM, Takuya Yoshikawa wrote: (2011/11/14 19:25), Avi Kivity wrote: On 11/14/2011 11:20 AM, Takuya Yoshikawa wrote: This is a revised version of my previous work. I hope that the patches are more self explanatory than before. It looks good. I'll let Marcelo (or anyone else?) review it as well before applying. Do you have performance measurements? For VGA, 30-40us became 3-5us when the display was quiet, with a enough warmed up guest. That's a nice improvement. Near the criterion, the number was not different much from the original version. For live migration, I forgot the number but the result was good. But my test case was not enough to cover every pattern, so I changed the criterion to be a bit conservative. More tests may be able to find a better criterion. I am not in a hurry about this, so it is OK to add some tests before merging this. I think we can merge is as is, it's clear we get an improvement. I did a simple test to show numbers! Here, a 4GB guest was being migrated locally during copying a file in it. Case 1. corresponds to the original method and case 2 does to the optimized one. Small numbers are, probably, from VGA: Case 1. about 30us Case 2. about 3us Other numbers are from the system RAM (triggered by live migration): Case 1. about 500us, 2000us Case 2. about 80us, 2000us (not exactly averaged, see below for details) * 2000us was when rmap was not used, so equal to that of case 1. So I can say that my patch worked well for both VGA and live migration. Takuya === measurement snippet === Case 1. kvm_mmu_slot_remove_write_access() only (same as the original method): qemu-system-x86-25413 [000] 6546.215009: funcgraph_entry: | write_protect_slot() { qemu-system-x86-25413 [000] 6546.215010: funcgraph_entry: ! 2039.512 us |kvm_mmu_slot_remove_write_access(); qemu-system-x86-25413 [000] 6546.217051: funcgraph_exit: ! 2040.487 us | } qemu-system-x86-25413 [002] 6546.217347: funcgraph_entry: | write_protect_slot() { qemu-system-x86-25413 [002] 6546.217349: funcgraph_entry: ! 571.121 us | kvm_mmu_slot_remove_write_access(); qemu-system-x86-25413 [002] 6546.217921: funcgraph_exit: ! 572.525 us | } qemu-system-x86-25413 [000] 6546.314583: funcgraph_entry: | write_protect_slot() { qemu-system-x86-25413 [000] 6546.314585: funcgraph_entry: + 29.598 us | kvm_mmu_slot_remove_write_access(); qemu-system-x86-25413 [000] 6546.314616: funcgraph_exit: + 31.053 us | } qemu-system-x86-25413 [000] 6546.314784: funcgraph_entry: | write_protect_slot() { qemu-system-x86-25413 [000] 6546.314785: funcgraph_entry: ! 2002.591 us |kvm_mmu_slot_remove_write_access(); qemu-system-x86-25413 [000] 6546.316788: funcgraph_exit: ! 2003.537 us | } qemu-system-x86-25413 [000] 6546.317082: funcgraph_entry: | write_protect_slot() { qemu-system-x86-25413 [000] 6546.317083: funcgraph_entry: ! 624.445 us | kvm_mmu_slot_remove_write_access(); qemu-system-x86-25413 [000] 6546.317709: funcgraph_exit: ! 625.861 us | } qemu-system-x86-25413 [000] 6546.414261: funcgraph_entry: | write_protect_slot() { qemu-system-x86-25413 [000] 6546.414263: funcgraph_entry: + 29.593 us | kvm_mmu_slot_remove_write_access(); qemu-system-x86-25413 [000] 6546.414293: funcgraph_exit: + 30.944 us | } qemu-system-x86-25413 [000] 6546.414528: funcgraph_entry: | write_protect_slot() { qemu-system-x86-25413 [000] 6546.414529: funcgraph_entry: ! 1990.363 us |kvm_mmu_slot_remove_write_access(); qemu-system-x86-25413 [000] 6546.416520: funcgraph_exit: ! 1991.370 us | } qemu-system-x86-25413 [000] 6546.416775: funcgraph_entry: | write_protect_slot() { qemu-system-x86-25413 [000] 6546.416776: funcgraph_entry: ! 594.333 us | kvm_mmu_slot_remove_write_access(); qemu-system-x86-25413 [000] 6546.417371: funcgraph_exit: ! 595.415 us | } qemu-system-x86-25413 [000] 6546.514133: funcgraph_entry: | write_protect_slot() { qemu-system-x86-25413 [000] 6546.514135: funcgraph_entry: + 24.032 us | kvm_mmu_slot_remove_write_access(); qemu-system-x86-25413 [000] 6546.514160: funcgraph_exit: + 25.074 us | } qemu-system-x86-25413 [000] 6546.514312: funcgraph_entry: | write_protect_slot() { qemu-system-x86-25413 [000] 6546.514313: funcgraph_entry: ! 2035.365 us |kvm_mmu_slot_remove_write_access(); qemu-system-x86-25413 [000] 6546.516349: funcgraph_exit: ! 2036.298 us | } qemu-system-x86-25413 [000] 6546.516642: funcgraph_entry: | write_protect_slot() {
Re: [PATCHv2 RFC] virtio-spec: flexible configuration layout
On Wed, 2011-11-16 at 09:21 +0200, Michael S. Tsirkin wrote: On Wed, Nov 16, 2011 at 10:28:52AM +1030, Rusty Russell wrote: On Fri, 11 Nov 2011 09:39:13 +0200, Sasha Levin levinsasha...@gmail.com wrote: On Fri, Nov 11, 2011 at 6:24 AM, Rusty Russell ru...@rustcorp.com.au wrote: (2) There's no huge win in keeping the same layout. Let's make some cleanups. There are more users ahead of us then behind us (I hope!). Actually, if we already do cleanups, here are two more suggestions: 1. Make 64bit features a one big 64bit block, instead of having 32bits in one place and 32 in another. 2. Remove the reserved fields out of the config (the ones that were caused by moving the ISR and the notifications out). Yes, those were exactly what I was thinking. I left it vague because there might be others you can see if we're prepared to abandon the current format. Cheers, Rusty. Yes but driver code doesn't get any cleaner by moving the fields. And in fact, the legacy support makes the code messier. What are the advantages? What about splitting the parts which handle legacy code and new code? It'll make it easier playing with the new spec more freely and will also make it easier removing legacy code in the future since you'll need to simply delete a chunk of code instead of removing legacy bits out of working code with a surgical knife. -- Sasha. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] KVM: Dirty logging optimization using rmap
On 11/16/2011 06:28 AM, Takuya Yoshikawa wrote: (2011/11/14 21:39), Avi Kivity wrote: There was a patchset from Peter Zijlstra that converted mmu notifiers to be preemptible, with that, we can convert the mmu spinlock to a mutex, I'll see what happened to it. Interesting! There is a third method of doing write protection, and that is by write-protecting at the higher levels of the paging hierarchy. The advantage there is that write protection is O(1) no matter how large the guest is, or the number of dirty pages. To write protect all guest memory, we just write protect the 512 PTEs at the very top, and leave the rest alone. When the guest writes to a page, we allow writes for the top-level PTE that faulted, and write-protect all the PTEs that it points to. One important point is that the guest, not GET DIRTY LOG caller, will pay for the write protection at the timing of faults. I don't think there is a significant difference. The number of write faults does not change. The amount of work done per fault does, but not by much, thanks to the writeable bitmap. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 RFC] virtio-spec: flexible configuration layout
On Wed, Nov 16, 2011 at 10:17:39AM +0200, Sasha Levin wrote: On Wed, 2011-11-16 at 09:21 +0200, Michael S. Tsirkin wrote: On Wed, Nov 16, 2011 at 10:28:52AM +1030, Rusty Russell wrote: On Fri, 11 Nov 2011 09:39:13 +0200, Sasha Levin levinsasha...@gmail.com wrote: On Fri, Nov 11, 2011 at 6:24 AM, Rusty Russell ru...@rustcorp.com.au wrote: (2) There's no huge win in keeping the same layout. Let's make some cleanups. There are more users ahead of us then behind us (I hope!). Actually, if we already do cleanups, here are two more suggestions: 1. Make 64bit features a one big 64bit block, instead of having 32bits in one place and 32 in another. 2. Remove the reserved fields out of the config (the ones that were caused by moving the ISR and the notifications out). Yes, those were exactly what I was thinking. I left it vague because there might be others you can see if we're prepared to abandon the current format. Cheers, Rusty. Yes but driver code doesn't get any cleaner by moving the fields. And in fact, the legacy support makes the code messier. What are the advantages? The advantages question is what should really balance out the overhead. What about splitting the parts which handle legacy code and new code? Well, I considered that. Something along the lines of #define VIRTIO_NEW_MSI_CONFIG_VECTOR18 And so on for all registers. This seems to add a significant maintainance burden because of code duplication. Note that, for example, vector programming is affected. Multiply that by the number of guest OSes. It'll make it easier playing with the new spec more freely I'm really worried about maintaing drivers long term. Ease of experimentation is secondary for me. and will also make it easier removing legacy code in the future since you'll need to simply delete a chunk of code instead of removing legacy bits out of working code with a surgical knife. It's unlikely to be a single chunk: we'd have structures and macros which are separate. So at least 3 chunks. Just for fun, here's what's involved in removing legacy map support on top of my patch. As you see there are 4 chunks: structure decl, map, unmap, and msix enable/disable. And finding them was as simple as looking for legacy_map. --- diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c index d242fcc..6c4d2faf 100644 --- a/drivers/virtio/virtio_pci.c +++ b/drivers/virtio/virtio_pci.c @@ -64,9 +64,6 @@ struct virtio_pci_device /* Various IO mappings: used for resource tracking only. */ - /* Legacy BAR0: typically PIO. */ - void __iomem *legacy_map; - /* Mappings specified by device capabilities: typically in MMIO */ void __iomem *isr_map; void __iomem *notify_map; @@ -81,11 +78,7 @@ struct virtio_pci_device static void virtio_pci_set_msix_enabled(struct virtio_pci_device *vp_dev, int enabled) { vp_dev-msix_enabled = enabled; - if (vp_dev-device_map) - vp_dev-ioaddr_device = vp_dev-device_map; - else - vp_dev-ioaddr_device = vp_dev-legacy_map + - VIRTIO_PCI_CONFIG(vp_dev); + vp_dev-ioaddr_device = vp_dev-device_map; } static void __iomem *virtio_pci_map_cfg(struct virtio_pci_device *vp_dev, u8 cap_id, @@ -147,8 +140,6 @@ err: static void virtio_pci_iounmap(struct virtio_pci_device *vp_dev) { - if (vp_dev-legacy_map) - pci_iounmap(vp_dev-pci_dev, vp_dev-legacy_map); if (vp_dev-isr_map) pci_iounmap(vp_dev-pci_dev, vp_dev-isr_map); if (vp_dev-notify_map) @@ -176,36 +167,15 @@ static int virtio_pci_iomap(struct virtio_pci_device *vp_dev) if (!vp_dev-notify_map || !vp_dev-common_map || !vp_dev-device_map) { - /* -* If not all capabilities present, map legacy PIO. -* Legacy access is at BAR 0. We never need to map -* more than 256 bytes there, since legacy config space -* used PIO which has this size limit. -* */ - vp_dev-legacy_map = pci_iomap(vp_dev-pci_dev, 0, 256); - if (!vp_dev-legacy_map) { - dev_err(vp_dev-vdev.dev, Unable to map legacy PIO); - goto err; - } + dev_err(vp_dev-vdev.dev, Unable to map IO); + goto err; } - /* Prefer MMIO if available. If not, fallback to legacy PIO. */ - if (vp_dev-common_map) - vp_dev-ioaddr = vp_dev-common_map; - else - vp_dev-ioaddr = vp_dev-legacy_map; + vp_dev-ioaddr = vp_dev-common_map; - if (vp_dev-device_map) - vp_dev-ioaddr_device = vp_dev-device_map; - else - vp_dev-ioaddr_device = vp_dev-legacy_map + - VIRTIO_PCI_CONFIG(vp_dev); +
Re: [RFC] kvm tools: Implement multiple VQ for virtio-net
jason wang jasow...@redhat.com wrote on 11/16/2011 11:40:45 AM: Hi Jason, Have any thought in mind to solve the issue of flow handling? So far nothing concrete. Maybe some performance numbers first is better, it would let us know where we are. During the test of my patchset, I find big regression of small packet transmission, and more retransmissions were noticed. This maybe also the issue of flow affinity. One interesting things is to see whether this happens in your patches :) I haven't got any results for small packet, but will run this week and send an update. I remember my earlier patches having regression for small packets. I've played with a basic flow director implementation based on my series which want to make sure the packets of a flow was handled by the same vhost thread/guest vcpu. This is done by: - bind virtqueue to guest cpu - record the hash to queue mapping when guest sending packets and use this mapping to choose the virtqueue when forwarding packets to guest Test shows some help during for receiving packets from external host and packet sending to local host. But it would hurt the performance of sending packets to remote host. This is not the perfect solution as it can not handle guest moving processes among vcpus, I plan to try accelerate RFS and sharing the mapping between host and guest. Anyway this is just for receiving, the small packet sending need more thoughts. I don't recollect small packet performance for guest-local host. Also, using multiple tuns devices on the bridge (instead of mq-tun) balances the rx/tx of a flow to a single vq. Then you can avoid mq-tun with it's queue selector function, etc. Have you tried it? I will run my tests this week and get back. thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] Introduce iommu_commit() function
On 06/23/2011 06:38 PM, David Woodhouse wrote: On Thu, 2011-06-23 at 17:31 +0200, Joerg Roedel wrote: David, I think especially VT-d can benefit from such a callback. I will implement support for it in the AMD IOMMU driver and post a patch-set soon. Any comments, thoughts? Ick. We *already* do the flushes as appropriate while we're filling the page tables. So every time we move on from one page table page to the next, we'll flush the old one. And when we've *done* filling the page tables for the range we've been asked to map, we flush the last writes too. For the current kvm use case flushing just once on commit is most efficient. If/when we get resumable io faults, per-page flushing becomes worthwhile. The problem with KVM is that it calls us over and over again to map a single 4KiB page. It doesn't seem simple to make use of a 'commit' function, because we'd have to keep track of *which* page tables are dirty. You could easily do that by using a free bit in the pte as a dirty bit. You can then choose whether to use per-page flush or a full flush. I'd much rather KVM just gave us a list of the pages to map, in a single call. The list can easily be several million pages long. Or even a 'translation' callback we could call to get the physical address for each page in the range. This is doable, and is probably most flexible. If the translation also returns ranges, then you don't have to figure out large mappings yourself. Not that there's a huge difference between iommu_begin(iommu_transaction, domain) for (page in range) iommu_map(iommu_transaction, page, translate(page)) iommu_commit(iommu_transaction) and iommu_map(domain, range, translate) - one can be converted to the other. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] kvm tools: Implement multiple VQ for virtio-net
On 11/16/2011 05:09 PM, Krishna Kumar2 wrote: jason wang jasow...@redhat.com wrote on 11/16/2011 11:40:45 AM: Hi Jason, Have any thought in mind to solve the issue of flow handling? So far nothing concrete. Maybe some performance numbers first is better, it would let us know where we are. During the test of my patchset, I find big regression of small packet transmission, and more retransmissions were noticed. This maybe also the issue of flow affinity. One interesting things is to see whether this happens in your patches :) I haven't got any results for small packet, but will run this week and send an update. I remember my earlier patches having regression for small packets. I've played with a basic flow director implementation based on my series which want to make sure the packets of a flow was handled by the same vhost thread/guest vcpu. This is done by: - bind virtqueue to guest cpu - record the hash to queue mapping when guest sending packets and use this mapping to choose the virtqueue when forwarding packets to guest Test shows some help during for receiving packets from external host and packet sending to local host. But it would hurt the performance of sending packets to remote host. This is not the perfect solution as it can not handle guest moving processes among vcpus, I plan to try accelerate RFS and sharing the mapping between host and guest. Anyway this is just for receiving, the small packet sending need more thoughts. I don't recollect small packet performance for guest-local host. Also, using multiple tuns devices on the bridge (instead of mq-tun) balances the rx/tx of a flow to a single vq. Then you can avoid mq-tun with it's queue selector function, etc. Have you tried it? I remember it works when I test your patchset early this year, but don't measure its performance. If multiple tuns devices were used, the mac address table would be updated very frequently and packets can not be forwarded in parallel ( unless we make bridge to support multiqueue ). I will run my tests this week and get back. thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] kvm tools: Add optional callbacks for VQs
This patch adds optional callbacks which get called when the VQ gets assigned an eventfd for notifications, and when it gets assigned with a GSI. This allows the device to pass the eventfds to 3rd parties which can use them to notify and get notifications regarding the VQ. Signed-off-by: Sasha Levin levinsasha...@gmail.com --- tools/kvm/include/kvm/virtio-trans.h |2 ++ tools/kvm/virtio/pci.c |6 ++ 2 files changed, 8 insertions(+), 0 deletions(-) diff --git a/tools/kvm/include/kvm/virtio-trans.h b/tools/kvm/include/kvm/virtio-trans.h index d9f4b95..e7c186e 100644 --- a/tools/kvm/include/kvm/virtio-trans.h +++ b/tools/kvm/include/kvm/virtio-trans.h @@ -20,6 +20,8 @@ struct virtio_ops { int (*notify_vq)(struct kvm *kvm, void *dev, u32 vq); int (*get_pfn_vq)(struct kvm *kvm, void *dev, u32 vq); int (*get_size_vq)(struct kvm *kvm, void *dev, u32 vq); + void (*notify_vq_gsi)(struct kvm *kvm, void *dev, u32 vq, u32 gsi); + void (*notify_vq_eventfd)(struct kvm *kvm, void *dev, u32 vq, u32 efd); }; struct virtio_trans_ops { diff --git a/tools/kvm/virtio/pci.c b/tools/kvm/virtio/pci.c index 1660f06..0737ae7 100644 --- a/tools/kvm/virtio/pci.c +++ b/tools/kvm/virtio/pci.c @@ -51,6 +51,9 @@ static int virtio_pci__init_ioeventfd(struct kvm *kvm, struct virtio_trans *vtra ioeventfd__add_event(ioevent); + if (vtrans-virtio_ops-notify_vq_eventfd) + vtrans-virtio_ops-notify_vq_eventfd(kvm, vpci-dev, vq, ioevent.fd); + return 0; } @@ -152,6 +155,9 @@ static bool virtio_pci__specific_io_out(struct kvm *kvm, struct virtio_trans *vt gsi = irq__add_msix_route(kvm, vpci-msix_table[vec].msg); vpci-gsis[vpci-queue_selector] = gsi; + if (vtrans-virtio_ops-notify_vq_gsi) + vtrans-virtio_ops-notify_vq_gsi(kvm, vpci-dev, + vpci-queue_selector, gsi); break; } }; -- 1.7.8.rc1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] kvm tools: Add vhost-net support
This patch adds support to using the vhost-net device when using a tap backed virtio-net device. Activating vhost-net is done by appending a 'vhost=1' flag to the net device configuration. For example: 'kvm run -n mode=tap,vhost=1' Cc: Michael S. Tsirkin m...@redhat.com Signed-off-by: Sasha Levin levinsasha...@gmail.com --- tools/kvm/builtin-run.c|2 + tools/kvm/include/kvm/virtio-net.h |1 + tools/kvm/virtio/net.c | 120 +++- 3 files changed, 122 insertions(+), 1 deletions(-) diff --git a/tools/kvm/builtin-run.c b/tools/kvm/builtin-run.c index 13025db..3b00bf0 100644 --- a/tools/kvm/builtin-run.c +++ b/tools/kvm/builtin-run.c @@ -217,6 +217,8 @@ static int set_net_param(struct virtio_net_params *p, const char *param, p-guest_ip = strdup(val); } else if (strcmp(param, host_ip) == 0) { p-host_ip = strdup(val); + } else if (strcmp(param, vhost) == 0) { + p-vhost = atoi(val); } return 0; diff --git a/tools/kvm/include/kvm/virtio-net.h b/tools/kvm/include/kvm/virtio-net.h index 58ae162..dade8cb 100644 --- a/tools/kvm/include/kvm/virtio-net.h +++ b/tools/kvm/include/kvm/virtio-net.h @@ -11,6 +11,7 @@ struct virtio_net_params { char host_mac[6]; struct kvm *kvm; int mode; + int vhost; }; void virtio_net__init(const struct virtio_net_params *params); diff --git a/tools/kvm/virtio/net.c b/tools/kvm/virtio/net.c index cee2b5b..58ca4ed 100644 --- a/tools/kvm/virtio/net.c +++ b/tools/kvm/virtio/net.c @@ -10,6 +10,7 @@ #include kvm/guest_compat.h #include kvm/virtio-trans.h +#include linux/vhost.h #include linux/virtio_net.h #include linux/if_tun.h #include linux/types.h @@ -25,6 +26,7 @@ #include sys/ioctl.h #include sys/types.h #include sys/wait.h +#include sys/eventfd.h #define VIRTIO_NET_QUEUE_SIZE 128 #define VIRTIO_NET_NUM_QUEUES 2 @@ -57,6 +59,7 @@ struct net_dev { pthread_mutex_t io_tx_lock; pthread_cond_t io_tx_cond; + int vhost_fd; int tap_fd; chartap_name[IFNAMSIZ]; @@ -323,9 +326,12 @@ static void set_guest_features(struct kvm *kvm, void *dev, u32 features) static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 pfn) { + struct vhost_vring_state state = { .index = vq }; + struct vhost_vring_addr addr; struct net_dev *ndev = dev; struct virt_queue *queue; void *p; + int r; compat__remove_message(compat_id); @@ -335,9 +341,82 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 pfn) vring_init(queue-vring, VIRTIO_NET_QUEUE_SIZE, p, VIRTIO_PCI_VRING_ALIGN); + if (ndev-vhost_fd == 0) + return 0; + + state.num = queue-vring.num; + r = ioctl(ndev-vhost_fd, VHOST_SET_VRING_NUM, state); + if (r 0) + die_perror(VHOST_SET_VRING_NUM failed); + state.num = 0; + r = ioctl(ndev-vhost_fd, VHOST_SET_VRING_BASE, state); + if (r 0) + die_perror(VHOST_SET_VRING_BASE failed); + + addr = (struct vhost_vring_addr) { + .index = vq, + .desc_user_addr = (u64)(unsigned long)queue-vring.desc, + .avail_user_addr = (u64)(unsigned long)queue-vring.avail, + .used_user_addr = (u64)(unsigned long)queue-vring.used, + }; + + r = ioctl(ndev-vhost_fd, VHOST_SET_VRING_ADDR, addr); + if (r 0) + die_perror(VHOST_SET_VRING_ADDR failed); + return 0; } +static void notify_vq_gsi(struct kvm *kvm, void *dev, u32 vq, u32 gsi) +{ + struct net_dev *ndev = dev; + struct kvm_irqfd irq; + struct vhost_vring_file file; + int r; + + if (ndev-vhost_fd == 0) + return; + + irq = (struct kvm_irqfd) { + .gsi= gsi, + .fd = eventfd(0, 0), + }; + file = (struct vhost_vring_file) { + .index = vq, + .fd = irq.fd, + }; + + r = ioctl(kvm-vm_fd, KVM_IRQFD, irq); + if (r 0) + die_perror(KVM_IRQFD failed); + + r = ioctl(ndev-vhost_fd, VHOST_SET_VRING_CALL, file); + if (r 0) + die_perror(VHOST_SET_VRING_CALL failed); + file.fd = ndev-tap_fd; + r = ioctl(ndev-vhost_fd, VHOST_NET_SET_BACKEND, file); + if (r != 0) + die(VHOST_NET_SET_BACKEND failed %d, errno); + +} + +static void notify_vq_eventfd(struct kvm *kvm, void *dev, u32 vq, u32 efd) +{ + struct net_dev *ndev = dev; + struct vhost_vring_file file = { + .index = vq, + .fd = efd, + }; + int r; + + if (ndev-vhost_fd == 0) + return; + + r = ioctl(ndev-vhost_fd,
Re: [PATCH 2/2] kvm tools: Add vhost-net support
On Wed, 2011-11-16 at 14:24 +0200, Sasha Levin wrote: This patch adds support to using the vhost-net device when using a tap backed virtio-net device. Activating vhost-net is done by appending a 'vhost=1' flag to the net device configuration. For example: 'kvm run -n mode=tap,vhost=1' Cc: Michael S. Tsirkin m...@redhat.com Signed-off-by: Sasha Levin levinsasha...@gmail.com --- I forgot to attach performance numbers to the changelog, so here they are: Short version -- TCP Throughput: +29% UDP Throughput: +10% TCP Latency: -15% UDP Latency: -12% Long version -- MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.33.4 (192.168.33.4) port 0 AF_INET Recv SendSend Socket Socket Message Elapsed Size SizeSize Time Throughput bytes bytes bytessecs.10^6bits/sec 87380 16384 1638410.004895.04 MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.33.4 (192.168.33.4) port 0 AF_INET Socket Message Elapsed Messages SizeSize Time Okay Errors Throughput bytes bytessecs# # 10^6bits/sec 229376 65507 10.00 125287 06565.60 229376 10.00 106910 5602.57 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.33.4 (192.168.33.4) port 0 AF_INET : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size SizeTime Rate bytes Bytes bytesbytes secs.per sec 16384 87380 11 10.0014811.55 MIGRATED UDP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.33.4 (192.168.33.4) port 0 AF_INET : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size SizeTime Rate bytes Bytes bytesbytes secs.per sec 229376 229376 11 10.0016000.44 229376 229376 After: MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.33.4 (192.168.33.4) port 0 AF_INET Recv SendSend Socket Socket Message Elapsed Size SizeSize Time Throughput bytes bytes bytessecs.10^6bits/sec 87380 16384 1638410.006340.74 MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.33.4 (192.168.33.4) port 0 AF_INET Socket Message Elapsed Messages SizeSize Time Okay Errors Throughput bytes bytessecs# # 10^6bits/sec 229376 65507 10.00 131478 06890.09 229376 10.00 118136 6190.90 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.33.4 (192.168.33.4) port 0 AF_INET : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size SizeTime Rate bytes Bytes bytesbytes secs.per sec 16384 87380 11 10.0017126.10 MIGRATED UDP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.33.4 (192.168.33.4) port 0 AF_INET : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size SizeTime Rate bytes Bytes bytesbytes secs.per sec 229376 229376 11 10.0017944.51 -- Sasha. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] Introduce iommu_commit() function
On Wed, Nov 16, 2011 at 11:00:56AM +0900, KyongHo Cho wrote: On Wed, Jun 29, 2011 at 2:51 PM, Joerg Roedel j...@8bytes.org wrote: In the 'next' branch of http://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git, I found that iommu_commit() is removed. Why is it removed? It was never in the next-branch. It actually is in the master-branch, but that happened accidentially :) The reason is that there is not enough consensus about this interface yet. This is also the reason I havn't pushed it upstream yet. Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] kvm tools: Add support for virtio-mmio
On Tue, 2011-11-15 at 17:56 +, Sasha Levin wrote: Hmm... If thats the plan, it should probably be a virtio thing (not virtio-mmio specific). Either way, it could also use some clarification in the spec. Well, the spec (p. 2.1) says: The Subsystem Vendor ID should reflect the PCI Vendor ID of the environment (it's currently only used for informational purposes by the guest).. The fact is that all the current virtio drivers simply ignore this field. So unless this changes I simply have no idea how to describe that register. Put anything there, no one cares? Write zero now, may change in future? Any ideas welcomed. Cheers! Paweł PS. Thanks for defending my honour in the delayed-explosive-device thread ;-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] kvm tools: Add support for virtio-mmio
On Wed, 2011-11-16 at 13:21 +, Pawel Moll wrote: On Tue, 2011-11-15 at 17:56 +, Sasha Levin wrote: Hmm... If thats the plan, it should probably be a virtio thing (not virtio-mmio specific). Either way, it could also use some clarification in the spec. Well, the spec (p. 2.1) says: The Subsystem Vendor ID should reflect the PCI Vendor ID of the environment (it's currently only used for informational purposes by the guest).. The fact is that all the current virtio drivers simply ignore this field. So unless this changes I simply have no idea how to describe that register. Put anything there, no one cares? Write zero now, may change in future? Any ideas welcomed. Cheers! Paweł PS. Thanks for defending my honour in the delayed-explosive-device thread ;-) We can add an appendix to the virtio spec with known virtio subsystem vendors, patch QEMU KVM tool to pass that, and possibly modify the QEMU related workarounds in the kernel to only do the workaround thing if QEMU is set as the vendor. -- Sasha. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] vfio: VFIO Driver core framework
On Fri, Nov 11, 2011 at 03:10:56PM -0700, Alex Williamson wrote: Thanks Konrad! Comments inline. On Fri, 2011-11-11 at 12:51 -0500, Konrad Rzeszutek Wilk wrote: On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote: VFIO provides a secure, IOMMU based interface for user space drivers, including device assignment to virtual machines. This provides the base management of IOMMU groups, devices, and IOMMU objects. See Documentation/vfio.txt included in this patch for user and kernel API description. Note, this implements the new API discussed at KVM Forum 2011, as represented by the drvier version 0.2. It's hoped that this provides a modular enough interface to support PCI and non-PCI userspace drivers across various architectures and IOMMU implementations. Signed-off-by: Alex Williamson alex.william...@redhat.com --- Fingers crossed, this is the last RFC for VFIO, but we need the iommu group support before this can go upstream (http://lkml.indiana.edu/hypermail/linux/kernel/1110.2/02303.html), hoping this helps push that along. Since the last posting, this version completely modularizes the device backends and better defines the APIs between the core VFIO code and the device backends. I expect that we might also adopt a modular IOMMU interface as iommu_ops learns about different types of hardware. Also many, many cleanups. Check the complete git history for details: git://github.com/awilliam/linux-vfio.git vfio-ng (matching qemu tree: git://github.com/awilliam/qemu-vfio.git) This version, along with the supporting VFIO PCI backend can be found here: git://github.com/awilliam/linux-vfio.git vfio-next-2003 I've held off on implementing a kernel-user signaling mechanism for now since the previous netlink version produced too many gag reflexes. It's easy enough to set a bit in the group flags too indicate such support in the future, so I think we can move ahead without it. Appreciate any feedback or suggestions. Thanks, Alex Documentation/ioctl/ioctl-number.txt |1 Documentation/vfio.txt | 304 + MAINTAINERS |8 drivers/Kconfig |2 drivers/Makefile |1 drivers/vfio/Kconfig |8 drivers/vfio/Makefile|3 drivers/vfio/vfio_iommu.c| 530 drivers/vfio/vfio_main.c | 1151 ++ drivers/vfio/vfio_private.h | 34 + include/linux/vfio.h | 155 + 11 files changed, 2197 insertions(+), 0 deletions(-) create mode 100644 Documentation/vfio.txt create mode 100644 drivers/vfio/Kconfig create mode 100644 drivers/vfio/Makefile create mode 100644 drivers/vfio/vfio_iommu.c create mode 100644 drivers/vfio/vfio_main.c create mode 100644 drivers/vfio/vfio_private.h create mode 100644 include/linux/vfio.h diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt index 54078ed..59d01e4 100644 --- a/Documentation/ioctl/ioctl-number.txt +++ b/Documentation/ioctl/ioctl-number.txt @@ -88,6 +88,7 @@ Code Seq#(hex) Include FileComments and kernel/power/user.c '8' all SNP8023 advanced NIC card mailto:m...@solidum.com +';' 64-76 linux/vfio.h '@' 00-0F linux/radeonfb.hconflict! '@' 00-0F drivers/video/aty/aty128fb.cconflict! 'A' 00-1F linux/apm_bios.hconflict! diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt new file mode 100644 index 000..5866896 --- /dev/null +++ b/Documentation/vfio.txt @@ -0,0 +1,304 @@ +VFIO - Virtual Function I/O[1] +--- +Many modern system now provide DMA and interrupt remapping facilities +to help ensure I/O devices behave within the boundaries they've been +allotted. This includes x86 hardware with AMD-Vi and Intel VT-d as +well as POWER systems with Partitionable Endpoints (PEs) and even +embedded powerpc systems (technology name unknown). The VFIO driver +is an IOMMU/device agnostic framework for exposing direct device +access to userspace, in a secure, IOMMU protected environment. In +other words, this allows safe, non-privileged, userspace drivers. + +Why do we want that? Virtual machines often make use of direct device +access (device assignment) when configured for the highest possible +I/O performance. From a device and host perspective, this simply turns +the VM into a userspace driver, with the benefits of significantly +reduced latency, higher
[ANNOUNCE] qemu-kvm-1.0-rc2
qemu-kvm-1.0-rc2 is now available. This release is based on the upstream qemu 1.0-rc2, plus kvm-specific enhancements. This release can be used with the kvm kernel modules provided by your distribution kernel, or by the modules in the kvm-kmod package, such as kvm-kmod-3.1. http://www.linux-kvm.org -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] vfio: VFIO Driver core framework
On 11/11/2011 04:10 PM, Alex Williamson wrote: Thanks Konrad! Comments inline. On Fri, 2011-11-11 at 12:51 -0500, Konrad Rzeszutek Wilk wrote: On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote: +When supported, as indicated by the device flags, reset the device. + +#define VFIO_DEVICE_RESET _IO(';', 116) Does it disable the 'count'? Err, does it disable the IRQ on the device after this and one should call VFIO_DEVICE_SET_IRQ_EVENTFDS to set new eventfds? Or does it re-use the eventfds and the device is enabled after this? It doesn't affect the interrupt programming. Should it? It should probably clear any currently pending interrupts, as if the unmask IOCTL were called. +device tree properties of the device: + +struct vfio_dtpath { +__u32 len;/* length of structure */ +__u32 index; 0 based I presume? Everything else is, I would assume so/ Yes, it should be zero-based -- this matches how such indices are done in the kernel device tree APIs. +__u64 flags; +#define VFIO_DTPATH_FLAGS_REGION(1 0) What is region in this context?? Or would this make much more sense if I knew what Device Tree actually is. Powerpc guys, any comments? This was their suggestion. These are effectively the first device specific extension, available when VFIO_DEVICE_FLAGS_DT is set. An assigned device may consist of an entire subtree of the device tree, and both register banks and interrupts can come from any node in the tree. Region versus IRQ here indicates the context in which to interpret index, in order to retrieve the path of the node that supplied this particular region or IRQ. +}; +#define VFIO_DEVICE_GET_DTPATH _IOWR(';', 117, struct vfio_dtpath) + +struct vfio_dtindex { +__u32 len;/* length of structure */ +__u32 index; +__u32 prop_type; Is that an enum type? Is this definied somewhere? +__u32 prop_index; What is the purpose of this field? Need input from powerpc folks here To identify what this resource (register bank or IRQ) this is, we need both the path to the node and the index into the reg or interrupts property within the node. We also need to distinguish reg from ranges, and interrupts from interrupt-map. As you suggested elsewhere in the thread, the device tree API should probably be left out for now, and added later along with the device tree bus driver. +static void __vfio_iommu_detach_dev(struct vfio_iommu *iommu, + struct vfio_device *device) +{ + BUG_ON(!iommu-domain device-attached); Whoa. Heavy hammer there. Perhaps WARN_ON as you do check it later on. I think it's warranted, internal consistency is broken if we have a device that thinks it's attached to an iommu domain that doesn't exist. It should, of course, never happen and this isn't a performance path. [snip] +static int __vfio_iommu_attach_dev(struct vfio_iommu *iommu, + struct vfio_device *device) +{ + int ret; + + BUG_ON(device-attached); How about: WARN_ON(device-attached, The engineer who wrote the user-space device driver is trying to register the device again! Tell him/her to stop please.\n); I would almost demote this one to a WARN_ON, but userspace isn't in control of attaching and detaching devices from the iommu. That's a side effect of getting the iommu or device file descriptor. So again, this is an internal consistency check and it should never happen, regardless of userspace. The rule isn't to use BUG for internal consistency checks and WARN for stuff userspace can trigger, but rather to use BUG if you cannot reasonably continue, WARN for significant issues that need prompt attention that are reasonably recoverable. Most instances of WARN are internal consistency checks. From include/asm-generic/bug.h: If you're tempted to BUG(), think again: is completely giving up really the *only* solution? There are usually better options, where users don't need to reboot ASAP and can mostly shut down cleanly. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/6] pci-assign: Multiple fixes and cleanups
These patches are all independent. Patch 1 2 fix serious usability bugs. Patches 3-6 are more subtle things that Markus was able to find with Coverity. Patch 1 fixes https://bugs.launchpad.net/qemu/+bug/875723 I also tested https://bugs.launchpad.net/qemu/+bug/877155 but I'm unable to reproduce. An 82576 VF works just fine in a Windows 2008 guest with this patch series. Thanks, Alex --- Alex Williamson (6): pci-assign: Harden I/O port test pci-assign: Remove bogus PCIe lnkcap wmask setting pci-assign: Fix PCIe lnkcap pci-assign: Fix PCI_EXP_FLAGS_TYPE shift pci-assign: Fix I/O port pci-assign: Fix device removal hw/device-assignment.c | 137 1 files changed, 57 insertions(+), 80 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/6] pci-assign: Fix device removal
We're destroying the memory container before we remove the subregions it holds. This fixes: https://bugs.launchpad.net/qemu/+bug/875723 Signed-off-by: Alex Williamson alex.william...@redhat.com --- hw/device-assignment.c | 13 + 1 files changed, 13 insertions(+), 0 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 11efd16..cde0681 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -677,10 +677,23 @@ static void free_assigned_device(AssignedDevice *dev) kvm_remove_ioport_region(region-u.r_baseport, region-r_size, dev-dev.qdev.hotplugged); } +memory_region_del_subregion(region-container, +region-real_iomem); +memory_region_destroy(region-real_iomem); +memory_region_destroy(region-container); } else if (pci_region-type IORESOURCE_MEM) { if (region-u.r_virtbase) { memory_region_del_subregion(region-container, region-real_iomem); + +/* Remove MSI-X table subregion */ +if (pci_region-base_addr = dev-msix_table_addr +pci_region-base_addr + pci_region-size +dev-msix_table_addr) { +memory_region_del_subregion(region-container, +dev-mmio); +} + memory_region_destroy(region-real_iomem); memory_region_destroy(region-container); if (munmap(region-u.r_virtbase, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/6] pci-assign: Fix I/O port
The old_portio structure seems broken. Throw it away and switch to the new style. This was hitting an assert when trying to make use of I/O port regions. Signed-off-by: Alex Williamson alex.william...@redhat.com --- hw/device-assignment.c | 103 1 files changed, 35 insertions(+), 68 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index cde0681..571a097 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -65,100 +65,76 @@ static void assigned_dev_load_option_rom(AssignedDevice *dev); static void assigned_dev_unregister_msix_mmio(AssignedDevice *dev); -static uint32_t assigned_dev_ioport_rw(AssignedDevRegion *dev_region, - uint32_t addr, int len, uint32_t *val) +static uint64_t assigned_dev_ioport_rw(AssignedDevRegion *dev_region, + target_phys_addr_t addr, int size, + uint64_t *data) { -uint32_t ret = 0; -uint32_t offset = addr; +uint64_t val = 0; int fd = dev_region-region-resource_fd; if (fd = 0) { -if (val) { -DEBUG(pwrite val=%x, len=%d, e_phys=%x, offset=%x\n, - *val, len, addr, offset); -if (pwrite(fd, val, len, offset) != len) { +if (data) { +DEBUG(pwrite data=%x, size=%d, e_phys=%x, addr=%x\n, + *data, size, addr, addr); +if (pwrite(fd, data, size, addr) != size) { fprintf(stderr, %s - pwrite failed %s\n, __func__, strerror(errno)); } } else { -if (pread(fd, ret, len, offset) != len) { +if (pread(fd, val, size, addr) != size) { fprintf(stderr, %s - pread failed %s\n, __func__, strerror(errno)); -ret = (1UL (len * 8)) - 1; +val = (1UL (size * 8)) - 1; } -DEBUG(pread ret=%x, len=%d, e_phys=%x, offset=%x\n, - ret, len, addr, offset); +DEBUG(pread val=%x, size=%d, e_phys=%x, addr=%x\n, + val, size, addr, addr); } } else { -uint32_t port = offset + dev_region-u.r_baseport; +uint32_t port = addr + dev_region-u.r_baseport; -if (val) { -DEBUG(out val=%x, len=%d, e_phys=%x, host=%x\n, - *val, len, addr, port); -switch (len) { +if (data) { +DEBUG(out data=%x, size=%d, e_phys=%x, host=%x\n, + *data, size, addr, port); +switch (size) { case 1: -outb(*val, port); +outb(*data, port); break; case 2: -outw(*val, port); +outw(*data, port); break; case 4: -outl(*val, port); +outl(*data, port); break; } } else { -switch (len) { +switch (size) { case 1: -ret = inb(port); +val = inb(port); break; case 2: -ret = inw(port); +val = inw(port); break; case 4: -ret = inl(port); +val = inl(port); break; } -DEBUG(in val=%x, len=%d, e_phys=%x, host=%x\n, - ret, len, addr, port); +DEBUG(in data=%x, size=%d, e_phys=%x, host=%x\n, + val, size, addr, port); } } -return ret; -} - -static void assigned_dev_ioport_writeb(void *opaque, uint32_t addr, - uint32_t value) -{ -assigned_dev_ioport_rw(opaque, addr, 1, value); -return; -} - -static void assigned_dev_ioport_writew(void *opaque, uint32_t addr, - uint32_t value) -{ -assigned_dev_ioport_rw(opaque, addr, 2, value); -return; -} - -static void assigned_dev_ioport_writel(void *opaque, uint32_t addr, - uint32_t value) -{ -assigned_dev_ioport_rw(opaque, addr, 4, value); -return; -} - -static uint32_t assigned_dev_ioport_readb(void *opaque, uint32_t addr) -{ -return assigned_dev_ioport_rw(opaque, addr, 1, NULL); +return val; } -static uint32_t assigned_dev_ioport_readw(void *opaque, uint32_t addr) +static void assigned_dev_ioport_write(void *opaque, target_phys_addr_t addr, + uint64_t data, unsigned size) { -return assigned_dev_ioport_rw(opaque, addr, 2, NULL); +assigned_dev_ioport_rw(opaque, addr, size, data); } -static uint32_t assigned_dev_ioport_readl(void *opaque, uint32_t addr) +static uint64_t
[PATCH 3/6] pci-assign: Fix PCI_EXP_FLAGS_TYPE shift
Coverity found that we're doing (uint16_t)type 0xf0 8. This is obviously always 0x0, so our attempt to filter out some device types thinks everything is an endpoint. Fix shift amount. Signed-off-by: Alex Williamson alex.william...@redhat.com --- hw/device-assignment.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 571a097..ec302d2 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1294,7 +1294,7 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) assigned_dev_setup_cap_read(dev, pos, size); type = pci_get_word(pci_dev-config + pos + PCI_EXP_FLAGS); -type = (type PCI_EXP_FLAGS_TYPE) 8; +type = (type PCI_EXP_FLAGS_TYPE) 4; if (type != PCI_EXP_TYPE_ENDPOINT type != PCI_EXP_TYPE_LEG_END type != PCI_EXP_TYPE_RC_END) { fprintf(stderr, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/6] pci-assign: Fix PCIe lnkcap
Another Coverity found issue, lnkcap is a 32bit register and we're masking bits 16 17. Fix to uin32_t. Signed-off-by: Alex Williamson alex.william...@redhat.com --- hw/device-assignment.c |8 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index ec302d2..dd92ce0 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1240,8 +1240,8 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) if ((pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_EXP, 0))) { uint8_t version, size = 0; -uint16_t type, devctl, lnkcap, lnksta; -uint32_t devcap; +uint16_t type, devctl, lnksta; +uint32_t devcap, lnkcap; version = pci_get_byte(pci_dev-config + pos + PCI_EXP_FLAGS); version = PCI_EXP_FLAGS_VERS; @@ -1326,11 +1326,11 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) pci_set_word(pci_dev-config + pos + PCI_EXP_DEVSTA, 0); /* Link capabilities, expose links and latencues, clear reporting */ -lnkcap = pci_get_word(pci_dev-config + pos + PCI_EXP_LNKCAP); +lnkcap = pci_get_long(pci_dev-config + pos + PCI_EXP_LNKCAP); lnkcap = (PCI_EXP_LNKCAP_SLS | PCI_EXP_LNKCAP_MLW | PCI_EXP_LNKCAP_ASPMS | PCI_EXP_LNKCAP_L0SEL | PCI_EXP_LNKCAP_L1EL); -pci_set_word(pci_dev-config + pos + PCI_EXP_LNKCAP, lnkcap); +pci_set_long(pci_dev-config + pos + PCI_EXP_LNKCAP, lnkcap); pci_set_word(pci_dev-wmask + pos + PCI_EXP_LNKCAP, PCI_EXP_LNKCTL_ASPMC | PCI_EXP_LNKCTL_RCB | PCI_EXP_LNKCTL_CCC | PCI_EXP_LNKCTL_ES | -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/6] pci-assign: Remove bogus PCIe lnkcap wmask setting
All the fields of lnkcap are read-only and this is setting it with mask values from LNKCTL. Just below it, we indicate link control is read only, so this appears to be a stray chunk left in from development. Trivial comment fix while we're here. Signed-off-by: Alex Williamson alex.william...@redhat.com --- hw/device-assignment.c |6 +- 1 files changed, 1 insertions(+), 5 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index dd92ce0..0160de7 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1312,7 +1312,7 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) pci_set_long(pci_dev-config + pos + PCI_EXP_DEVCAP, devcap); /* device control: clear all error reporting enable bits, leaving - * leaving only a few host values. Note, these are + * only a few host values. Note, these are * all writable, but not passed to hw. */ devctl = pci_get_word(pci_dev-config + pos + PCI_EXP_DEVCTL); @@ -1331,10 +1331,6 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) PCI_EXP_LNKCAP_ASPMS | PCI_EXP_LNKCAP_L0SEL | PCI_EXP_LNKCAP_L1EL); pci_set_long(pci_dev-config + pos + PCI_EXP_LNKCAP, lnkcap); -pci_set_word(pci_dev-wmask + pos + PCI_EXP_LNKCAP, - PCI_EXP_LNKCTL_ASPMC | PCI_EXP_LNKCTL_RCB | - PCI_EXP_LNKCTL_CCC | PCI_EXP_LNKCTL_ES | - PCI_EXP_LNKCTL_CLKREQ_EN | PCI_EXP_LNKCTL_HAWD); /* Link control, pass existing read-only copy. Should be writable? */ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/6] pci-assign: Harden I/O port test
Markus Armbruster points out that we're missing a 0 check from pread while trying to probe for pci-sysfs io-port resource support. We don't expect a short read, but we should harden the test to abort if we get one so we're not potentially looking at a stale errno. Signed-off-by: Alex Williamson alex.william...@redhat.com --- hw/device-assignment.c |5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 0160de7..7e6f972 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -434,8 +434,9 @@ static int assigned_dev_register_regions(PCIRegion *io_regions, * kernels return EIO. New kernels only allow 1/2/4 byte reads * so should return EINVAL for a 3 byte read */ ret = pread(pci_dev-v_addrs[i].region-resource_fd, val, 3, 0); -if (ret == 3) { -fprintf(stderr, I/O port resource supports 3 byte read?!\n); +if (ret = 0) { +fprintf(stderr, Unexpected return from I/O port read: %d\n, +ret); abort(); } else if (errno != EINVAL) { fprintf(stderr, Using raw in/out ioport access (sysfs - %s)\n, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] vfio: VFIO Driver core framework
On Tue, 2011-11-15 at 16:29 -0600, Scott Wood wrote: On 11/15/2011 03:40 PM, Aaron Fabbri wrote: On 11/15/11 12:10 PM, Scott Wood scottw...@freescale.com wrote: On 11/15/2011 12:34 AM, David Gibson wrote: snip +static int allow_unsafe_intrs; +module_param(allow_unsafe_intrs, int, 0); +MODULE_PARM_DESC(allow_unsafe_intrs, +Allow use of IOMMUs which do not support interrupt remapping); This should not be a global option, but part of the AMD/Intel IOMMU specific code. In general it's a question of how strict the IOMMU driver is about isolation when it determines what the groups are, and only the IOMMU driver can know what the possibilities are for its class of hardware. It's also a concern that is specific to MSIs. In any case, I'm not sure that the ability to cause a spurious IRQ is bad enough to warrant disabling the entire subsystem by default on certain hardware. I think the issue is more that the ability to create fake MSI interrupts can lead to bigger exploits. Originally we didn't have this parameter. It was added it to reflect the fact that MSI's triggered by guests are dangerous without the isolation that interrupt remapping provides. That is, it *should* be inconvenient to run without interrupt mapping HW support. A sysfs knob is sufficient inconvenience. It should only affect whether you can use MSIs, and the relevant issue shouldn't be has interrupt remapping but is there a hole. Some systems might address the issue in ways other than IOMMU-level MSI translation. Our interrupt controller provides enough separate 4K pages for MSI interrupt delivery for each PCIe IOMMU group to get its own (we currently only have 3, one per root complex) -- no special IOMMU feature required. It doesn't help that the semantics of IOMMU_CAP_INTR_REMAP are undefined. I shouldn't have to know how x86 IOMMUs work when implementing a driver for different hardware, just to know what the generic code is expecting. As David suggests, if you want to do this it should be the x86 IOMMU driver that has a knob that controls how it forms groups in the absence of this support. That is a possibility, we could push it down to the iommu driver which could simply lump everything into a single groupid when interrupt remapping is not supported. Or more directly, when there is an exposure that devices can trigger random MSIs in the host. Then we wouldn't need an option to override this in vfio, you'd just be stuck not being able to use any devices if you can't bind everything to vfio. That also eliminates the possibility of flipping it on dynamically since we can't handle groupids changing. Then we'd need an iommu=group_unsafe_msi flag to enable it. Ok? Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 11/11] KVM: PPC: Eliminate global spinlock in kvmppc_h_enter
From dfd5bcfac841f8a36593edf60d9fb15e0d633287 Mon Sep 17 00:00:00 2001 From: Paul Mackerras pau...@samba.org Date: Mon, 14 Nov 2011 13:30:38 +1100 Subject: Currently, kvmppc_h_enter takes a spinlock that is global to the guest, kvm-mmu_lock, in order to check for pending PTE invalidations safely. On some workloads, kvmppc_h_enter is called heavily and the use of a global spinlock could compromise scalability. We already use a per- guest page spinlock in the form of the bit spinlock on the rmap chain, and this gives us synchronization with the PTE invalidation side, which also takes the bit spinlock on the rmap chain for each page being invalidated. Thus it is sufficient to check for pending invalidations while the rmap chain bit spinlock is held. However, now we require barriers in mmu_notifier_retry() and in the places where mmu_notifier_count and mmu_notifier_seq are updated, since we can now call mmu_notifier_retry() concurrently with updates to those fields. Signed-off-by: Paul Mackerras pau...@samba.org --- Cc'd to kvm@vger.kernel.org for review of the generic kvm changes. arch/powerpc/include/asm/kvm_book3s_64.h | 13 + arch/powerpc/kvm/book3s_64_mmu_hv.c | 19 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 75 - include/linux/kvm_host.h | 13 +++-- virt/kvm/kvm_main.c |4 ++ 5 files changed, 66 insertions(+), 58 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h index 3745337..db6cbd5 100644 --- a/arch/powerpc/include/asm/kvm_book3s_64.h +++ b/arch/powerpc/include/asm/kvm_book3s_64.h @@ -161,4 +161,17 @@ static inline unsigned long kvmppc_read_update_linux_pte(pte_t *p) return pfn; } +static inline void lock_rmap(unsigned long *rmap) +{ + do { + while (test_bit(KVMPPC_RMAP_LOCK_BIT, rmap)) + cpu_relax(); + } while (test_and_set_bit_lock(KVMPPC_RMAP_LOCK_BIT, rmap)); +} + +static inline void unlock_rmap(unsigned long *rmap) +{ + __clear_bit_unlock(KVMPPC_RMAP_LOCK_BIT, rmap); +} + #endif /* __ASM_KVM_BOOK3S_64_H__ */ diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 8c497b8..bb75bfb 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -611,12 +611,6 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, goto out_put; pfn = page_to_pfn(page); - /* Check if we might have been invalidated; let the guest retry if so */ - ret = RESUME_GUEST; - spin_lock(kvm-mmu_lock); - if (mmu_notifier_retry(vcpu, mmu_seq)) - goto out_unlock; - /* Set the HPTE to point to pfn */ ret = RESUME_GUEST; hptep = (unsigned long *)(kvm-arch.hpt_virt + (index 4)); @@ -627,19 +621,26 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, rev-guest_rpte != hpte[2]) { /* HPTE has been changed under us; let the guest retry */ hptep[0] = ~HPTE_V_HVLOCK; - goto out_unlock; + goto out_put; } hpte[0] = (hpte[0] ~HPTE_V_ABSENT) | HPTE_V_VALID; hpte[1] = (rev-guest_rpte ~(HPTE_R_PP0 - pte_size)) | (pfn PAGE_SHIFT); rmap = memslot-rmap[gfn - memslot-base_gfn]; + lock_rmap(rmap); + + /* Check if we might have been invalidated; let the guest retry if so */ + ret = RESUME_GUEST; + if (mmu_notifier_retry(vcpu, mmu_seq)) { + unlock_rmap(rmap); + hptep[0] = ~HPTE_V_HVLOCK; + goto out_put; + } kvmppc_add_revmap_chain(kvm, rev, rmap, index, 0); kvmppc_modify_hpte(kvm, hptep, hpte, index); if (page) SetPageDirty(page); - out_unlock: - spin_unlock(kvm-mmu_lock); out_put: if (page) put_page(page); diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 2cadd06..4070920 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -57,22 +57,16 @@ static struct kvm_memory_slot *builtin_gfn_to_memslot(struct kvm *kvm, return NULL; } -static void lock_rmap(unsigned long *rmap) -{ - do { - while (test_bit(KVMPPC_RMAP_LOCK_BIT, rmap)) - cpu_relax(); - } while (test_and_set_bit_lock(KVMPPC_RMAP_LOCK_BIT, rmap)); -} - -/* Add this HPTE into the chain for the real page */ +/* + * Add this HPTE into the chain for the real page. + * Must be called with the chain locked; it unlocks the chain. + */ void kvmppc_add_revmap_chain(struct kvm *kvm, struct revmap_entry *rev, unsigned long *rmap, long pte_index, int realmode) { struct revmap_entry *head, *tail; unsigned long i; -
Re: [RFC PATCH] vfio: VFIO Driver core framework
On Tue, Nov 15, 2011 at 11:01:28AM -0700, Alex Williamson wrote: On Tue, 2011-11-15 at 17:34 +1100, David Gibson wrote: On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote: diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt new file mode 100644 index 000..5866896 --- /dev/null +++ b/Documentation/vfio.txt @@ -0,0 +1,304 @@ +VFIO - Virtual Function I/O[1] +--- +Many modern system now provide DMA and interrupt remapping facilities +to help ensure I/O devices behave within the boundaries they've been +allotted. This includes x86 hardware with AMD-Vi and Intel VT-d as +well as POWER systems with Partitionable Endpoints (PEs) and even +embedded powerpc systems (technology name unknown). The VFIO driver +is an IOMMU/device agnostic framework for exposing direct device +access to userspace, in a secure, IOMMU protected environment. In +other words, this allows safe, non-privileged, userspace drivers. It's perhaps worth emphasisng that safe depends on the hardware being sufficiently well behaved. BenH, I know, thinks there are a *lot* of cards that, e.g. have debug registers that allow a backdoor to their own config space via MMIO, which would bypass vfio's filtering of config space access. And that's before we even get into the varying degrees of completeness in the isolation provided by different IOMMUs. Fair enough. I know Tom had emphasized well behaved in the original doc. Virtual functions are probably the best indicator of well behaved. +Why do we want that? Virtual machines often make use of direct device +access (device assignment) when configured for the highest possible +I/O performance. From a device and host perspective, this simply turns +the VM into a userspace driver, with the benefits of significantly +reduced latency, higher bandwidth, and direct use of bare-metal device +drivers[2]. + +Some applications, particularly in the high performance computing +field, also benefit from low-overhead, direct device access from +userspace. Examples include network adapters (often non-TCP/IP based) +and compute accelerators. Previous to VFIO, these drivers needed to s/Previous/Prior/ although that may be a .us vs .au usage thing. Same difference, AFAICT. +go through the full development cycle to become proper upstream driver, +be maintained out of tree, or make use of the UIO framework, which +has no notion of IOMMU protection, limited interrupt support, and +requires root privileges to access things like PCI configuration space. + +The VFIO driver framework intends to unify these, replacing both the +KVM PCI specific device assignment currently used as well as provide +a more secure, more featureful userspace driver environment than UIO. + +Groups, Devices, IOMMUs, oh my +--- + +A fundamental component of VFIO is the notion of IOMMU groups. IOMMUs +can't always distinguish transactions from each individual device in +the system. Sometimes this is because of the IOMMU design, such as with +PEs, other times it's caused by the I/O topology, for instance a +PCIe-to-PCI bridge masking all devices behind it. We call the sets of +devices created by these restictions IOMMU groups (or just groups for +this document). + +The IOMMU cannot distiguish transactions between the individual devices +within the group, therefore the group is the basic unit of ownership for +a userspace process. Because of this, groups are also the primary +interface to both devices and IOMMU domains in VFIO. + +The VFIO representation of groups is created as devices are added into +the framework by a VFIO bus driver. The vfio-pci module is an example +of a bus driver. This module registers devices along with a set of bus +specific callbacks with the VFIO core. These callbacks provide the +interfaces later used for device access. As each new group is created, +as determined by iommu_device_group(), VFIO creates a /dev/vfio/$GROUP +character device. Ok.. so, the fact that it's called vfio-pci suggests that the VFIO bus driver is per bus type, not per bus instance. But grouping constraints could be per bus instance, if you have a couple of different models of PCI host bridge with IOMMUs of different capabilities built in, for example. Yes, vfio-pci manages devices on the pci_bus_type; per type, not per bus instance. Ok, how can that work. vfio-pci is responsible for generating the groupings, yes? For which it needs to know the iommu/host bridge's isolation capabilities, which vary depending on the type of host bridge. IOMMUs also register drivers per bus type, not per bus instance. The IOMMU driver is free to impose
kvm-tools: can't seem to set guest_mac and KVM_GET_SUPPORTED_CPUID failed.
There was a patch (quoted below) that changed networking at the end of September. When I try to set the guest_mac from the usage in the patch and an admittaly too brief a look at the code, the guest's mac address isn't being set. I'm using: sudo /path/to/linux-kvm/tools/kvm/kvm run -c 1 -m 256 -k /path/to/bzImage-3.0.8 \ -i /path/to/initramfs-host.img --console serial -p ' console=ttyS0 ' -n tap,guest_mac=00:11:11:11:11:11 In the guest I get: # ifconfig eth0 eth0 Link encap:Ethernet HWaddr 02:15:15:15:15:15 inet addr:192.168.122.237 Bcast:192.168.122.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:24 errors:0 dropped:2 overruns:0 frame:0 TX packets:2 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:1874 (1.8 KiB) TX bytes:656 (656.0 B) which is the default. Also, when I start the guest I sometimes get the following error message: # kvm run -k /path/to/bzImage-3.0.8 -m 256 -c 1 --name guest-15757 KVM_GET_SUPPORTED_CPUID failed: Argument list too long I haven't seen that before. Thanks, \dae On Sat, Sep 24, 2011 at 12:17:51PM +0300, Sasha Levin wrote: This patch adds support for multiple network devices. The command line syntax changes to the following: --network/-n [mode=[tap/user/none]] [guest_ip=[guest ip]] [host_ip= [host_ip]] [guest_mac=[guest_mac]] [script=[script]] Each of the parameters is optional, and the config defaults to a TAP based networking with a random MAC. ... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm-tools: can't seem to set guest_mac and KVM_GET_SUPPORTED_CPUID failed.
On Wed, 2011-11-16 at 16:42 -0800, David Evensky wrote: There was a patch (quoted below) that changed networking at the end of September. When I try to set the guest_mac from the usage in the patch and an admittaly too brief a look at the code, the guest's mac address isn't being set. I'm using: sudo /path/to/linux-kvm/tools/kvm/kvm run -c 1 -m 256 -k /path/to/bzImage-3.0.8 \ -i /path/to/initramfs-host.img --console serial -p ' console=ttyS0 ' -n tap,guest_mac=00:11:11:11:11:11 In the guest I get: # ifconfig eth0 eth0 Link encap:Ethernet HWaddr 02:15:15:15:15:15 inet addr:192.168.122.237 Bcast:192.168.122.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:24 errors:0 dropped:2 overruns:0 frame:0 TX packets:2 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:1874 (1.8 KiB) TX bytes:656 (656.0 B) which is the default. This should be '-n mode=tap,guest_mac=00:11:11:11:11:11' It will set the right mac: sh-2.05b# ifconfig eth0 Link encap:Ethernet HWaddr 00:11:11:11:11:11 [...] Also, when I start the guest I sometimes get the following error message: # kvm run -k /path/to/bzImage-3.0.8 -m 256 -c 1 --name guest-15757 KVM_GET_SUPPORTED_CPUID failed: Argument list too long Heh, we were talking about it couple of weeks ago, but since I couldn't reproduce it here (it was happening to me before, but now it's gone) the discussing died. Could you please provide some statistics on how often it happens to you? Also, can you try wrapping the ioctl with a 'while (1)' (theres only 1 ioctl call to KVM_GET_SUPPORTED_CPUID) and see if it would happen at some point? Thanks! I haven't seen that before. Thanks, \dae On Sat, Sep 24, 2011 at 12:17:51PM +0300, Sasha Levin wrote: This patch adds support for multiple network devices. The command line syntax changes to the following: --network/-n [mode=[tap/user/none]] [guest_ip=[guest ip]] [host_ip= [host_ip]] [guest_mac=[guest_mac]] [script=[script]] Each of the parameters is optional, and the config defaults to a TAP based networking with a random MAC. ... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Sasha. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm-tools: can't seem to set guest_mac and KVM_GET_SUPPORTED_CPUID failed.
On Thu, Nov 17, 2011 at 8:07 AM, Sasha Levin levinsasha...@gmail.com wrote: Also, when I start the guest I sometimes get the following error message: # kvm run -k /path/to/bzImage-3.0.8 -m 256 -c 1 --name guest-15757 KVM_GET_SUPPORTED_CPUID failed: Argument list too long Heh, we were talking about it couple of weeks ago, but since I couldn't reproduce it here (it was happening to me before, but now it's gone) the discussing died. Could you please provide some statistics on how often it happens to you? Also, can you try wrapping the ioctl with a 'while (1)' (theres only 1 ioctl call to KVM_GET_SUPPORTED_CPUID) and see if it would happen at some point? I'm no longer able to reproduce it here with 3.2-rc1. We could just try the easy way out and do what Qemu does and retry for E2BIG... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm-tools: can't seem to set guest_mac and KVM_GET_SUPPORTED_CPUID failed.
On Thu, 2011-11-17 at 08:53 +0200, Pekka Enberg wrote: On Thu, Nov 17, 2011 at 8:07 AM, Sasha Levin levinsasha...@gmail.com wrote: Also, when I start the guest I sometimes get the following error message: # kvm run -k /path/to/bzImage-3.0.8 -m 256 -c 1 --name guest-15757 KVM_GET_SUPPORTED_CPUID failed: Argument list too long Heh, we were talking about it couple of weeks ago, but since I couldn't reproduce it here (it was happening to me before, but now it's gone) the discussing died. Could you please provide some statistics on how often it happens to you? Also, can you try wrapping the ioctl with a 'while (1)' (theres only 1 ioctl call to KVM_GET_SUPPORTED_CPUID) and see if it would happen at some point? I'm no longer able to reproduce it here with 3.2-rc1. We could just try the easy way out and do what Qemu does and retry for E2BIG... Let's not do that :) It'll just get uncovered again when someone decides to use KVM_GET_SUPPORTED_CPUID somewhere else (like in Avi's cpuid patch). I'll try going back to 3.0 later today and see if it comes back. David, which host kernel do you use? -- Sasha. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH] kvm tools, qcow: Add the support for copy-on-write clusters
When meeting request to write the cluster without copied flag, allocate a new cluster and write original data with modification to the new cluster. This also can add support for the writing operation of the qcow2 compressed image. Signed-off-by: Lan Tianyu tianyu@intel.com --- tools/kvm/disk/qcow.c| 322 -- tools/kvm/include/kvm/qcow.h |2 + 2 files changed, 218 insertions(+), 106 deletions(-) diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c index 680b37d..2b9af73 100644 --- a/tools/kvm/disk/qcow.c +++ b/tools/kvm/disk/qcow.c @@ -122,9 +122,6 @@ static int cache_table(struct qcow *q, struct qcow_l2_table *c) */ lru = list_first_entry(l1t-lru_list, struct qcow_l2_table, list); - if (qcow_l2_cache_write(q, lru) 0) - goto error; - /* Remove the node from the cache */ rb_erase(lru-node, r); list_del_init(lru-list); @@ -728,35 +725,110 @@ error_free_rfb: return NULL; } -/* - * QCOW file might grow during a write operation. Not only data but metadata is - * also written at the end of the file. Therefore it is necessary to ensure - * every write is committed to disk. Hence we use uses qcow_pwrite_sync() to - * synchronize the in-core state of QCOW image to disk. - * - * We also try to restore the image to a consistent state if the metdata - * operation fails. The two metadat operations are: level 1 and level 2 table - * update. If either of them fails the image is truncated to a consistent state. +static u16 qcow_get_refcount(struct qcow *q, u64 clust_idx) +{ + struct qcow_refcount_block *rfb = NULL; + struct qcow_header *header = q-header; + u64 rfb_idx; + + rfb = qcow_read_refcount_block(q, clust_idx); + if (!rfb) { + pr_warning(error while reading refcount table); + return -1; + } + + rfb_idx = clust_idx (((1ULL + (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1)); + + if (rfb_idx = rfb-size) { + pr_warning(L1: refcount block index out of bounds); + return -1; + } + + return be16_to_cpu(rfb-entries[rfb_idx]); +} + +static int update_cluster_refcount(struct qcow *q, u64 clust_idx, u16 append) +{ + struct qcow_refcount_block *rfb = NULL; + struct qcow_header *header = q-header; + u16 refcount; + u64 rfb_idx; + + rfb = qcow_read_refcount_block(q, clust_idx); + if (!rfb) { + pr_warning(error while reading refcount table); + return -1; + } + + rfb_idx = clust_idx (((1ULL + (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1)); + if (rfb_idx = rfb-size) { + pr_warning(refcount block index out of bounds); + return -1; + } + + refcount = be16_to_cpu(rfb-entries[rfb_idx]) + append; + rfb-entries[rfb_idx] = cpu_to_be16(refcount); + rfb-dirty = 1; + + /*write refcount block*/ + write_refcount_block(q, rfb); + + /*update free_clust_idx since refcount becomes zero*/ + if (!refcount clust_idx q-free_clust_idx) + q-free_clust_idx = clust_idx; + + return 0; +} + +/*Allocate clusters according to the size. Find a postion that + *can satisfy the size. free_clust_idx is initialized to zero and + *Record last position. +*/ +static u64 qcow_alloc_clusters(struct qcow *q, u64 size) +{ + struct qcow_header *header = q-header; + u16 clust_refcount; + u32 clust_idx, i; + u64 clust_num; + + clust_num = (size + (q-cluster_size - 1)) header-cluster_bits; + +again: + for (i = 0; i clust_num; i++) { + clust_idx = q-free_clust_idx++; + clust_refcount = qcow_get_refcount(q, clust_idx); + if (clust_refcount 0) + return -1; + else if (clust_refcount 0) + goto again; + } + + for (i = 0; i clust_num; i++) + update_cluster_refcount(q, + q-free_clust_idx - clust_num + i, 1); + + return (q-free_clust_idx - clust_num) header-cluster_bits; +} + +/*Get l2 table. If the table has been copied, read table directly. + *If the table exists, allocate a new cluster and copy the table + *to the new cluster. */ -static ssize_t qcow_write_cluster(struct qcow *q, u64 offset, void *buf, u32 src_len) +static int get_cluster_table(struct qcow *q, u64 offset, + struct qcow_l2_table **result_l2t, u64 *result_l2_index) { struct qcow_header *header = q-header; struct qcow_l1_table *l1t = q-table; struct qcow_l2_table *l2t; - u64 clust_start; - u64 clust_flags; - u64 l2t_offset; - u64 clust_off; - u64 l2t_size; - u64 clust_sz; u64 l1t_idx; + u64 l2t_offset;
[RFC PATCH 10/11] KVM: PPC: Implement MMU notifiers
This implements the low-level functions called by the MMU notifiers in the generic KVM code, and defines KVM_ARCH_WANT_MMU_NOTIFIER if CONFIG_KVM_BOOK3S_64_HV so that the generic KVM MMU notifiers get included. That means we also have to take notice of when PTE invalidations are in progress, as indicated by mmu_notifier_retry(). In kvmppc_h_enter, if any invalidation is in progress we just install a non-present HPTE. In kvmppc_book3s_hv_page_fault, if an invalidation is in progress we just return without resolving the guest, causing it to encounter another page fault immediately. This is better than spinning inside kvmppc_book3s_hv_page_fault because this way the guest can get preempted by a hypervisor decrementer interrupt without us having to do any special checks. We currently maintain a referenced bit in the rmap array, and when we clear it, we make all the HPTEs that map the corresponding page be non-present, as if the page were invalidated. In future we could use the hardware reference bit in the guest HPT instead. The kvm_set_spte_hva function is implemented as kvm_unmap_hva. The former appears to be unused anyway. This all means that on processors that support virtual partition memory (POWER7), we can claim support for the KVM_CAP_SYNC_MMU capability, and we no longer have to pin all the guest memory. Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/include/asm/kvm_host.h | 13 +++ arch/powerpc/kvm/Kconfig|1 + arch/powerpc/kvm/book3s_64_mmu_hv.c | 160 ++- arch/powerpc/kvm/book3s_hv.c| 25 +++-- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 34 ++- arch/powerpc/kvm/powerpc.c |3 + 6 files changed, 218 insertions(+), 18 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 3dfac3d..79bfc69 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -44,6 +44,19 @@ #define KVM_COALESCED_MMIO_PAGE_OFFSET 1 #endif +#ifdef CONFIG_KVM_BOOK3S_64_HV +#include linux/mmu_notifier.h + +#define KVM_ARCH_WANT_MMU_NOTIFIER + +struct kvm; +extern int kvm_unmap_hva(struct kvm *kvm, unsigned long hva); +extern int kvm_age_hva(struct kvm *kvm, unsigned long hva); +extern int kvm_test_age_hva(struct kvm *kvm, unsigned long hva); +extern void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); + +#endif + /* We don't currently support large pages. */ #define KVM_HPAGE_GFN_SHIFT(x) 0 #define KVM_NR_PAGE_SIZES 1 diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig index 78133de..8f64709 100644 --- a/arch/powerpc/kvm/Kconfig +++ b/arch/powerpc/kvm/Kconfig @@ -69,6 +69,7 @@ config KVM_BOOK3S_64 config KVM_BOOK3S_64_HV bool KVM support for POWER7 and PPC970 using hypervisor mode in host depends on KVM_BOOK3S_64 + select MMU_NOTIFIER ---help--- Support running unmodified book3s_64 guest kernels in virtual machines on POWER7 and PPC970 processors that have diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index e93c789..8c497b8 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -138,6 +138,15 @@ void kvmppc_map_vrma(struct kvm *kvm, struct kvm_userspace_memory_region *mem) hp1 = hpte1_pgsize_encoding(psize) | HPTE_R_R | HPTE_R_C | HPTE_R_M | PP_RWXX; + spin_lock(kvm-mmu_lock); + /* wait until no invalidations are in progress */ + while (kvm-mmu_notifier_count) { + spin_unlock(kvm-mmu_lock); + while (kvm-mmu_notifier_count) + cpu_relax(); + spin_lock(kvm-mmu_lock); + } + for (i = 0; i npages; ++i) { addr = i porder; if (pfns) { @@ -185,6 +194,7 @@ void kvmppc_map_vrma(struct kvm *kvm, struct kvm_userspace_memory_region *mem) KVMPPC_RMAP_REFERENCED | KVMPPC_RMAP_PRESENT; } } + spin_unlock(kvm-mmu_lock); } int kvmppc_mmu_hv_init(void) @@ -506,7 +516,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, struct kvm *kvm = vcpu-kvm; struct kvmppc_slb *slbe; unsigned long *hptep, hpte[3]; - unsigned long psize, pte_size; + unsigned long mmu_seq, psize, pte_size; unsigned long gfn, hva, pfn, amr; struct kvm_memory_slot *memslot; unsigned long *rmap; @@ -581,6 +591,11 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, if (kvm-arch.slot_pfns[memslot-id]) return -EFAULT; /* should never get here */ hva = gfn_to_hva_memslot(memslot, gfn); + + /* used to check for invalidations in progress */ + mmu_seq = kvm-mmu_notifier_seq; + smp_rmb(); + npages = get_user_pages_fast(hva,
[RFC PATCH 07/11] KVM: PPC: Convert do_h_register_vpa to use Linux page tables
This makes do_h_register_vpa use a new helper function, kvmppc_pin_guest_page, to pin the page containing the virtual processor area that the guest wants to register. The logic of whether to use the userspace Linux page tables or the slot_pfns array is thus hidden in kvmppc_pin_guest_page. There is also a new kvmppc_unpin_guest_page to release a previously-pinned page, which we call at VPA unregistration time, or when a new VPA is registered, or when the vcpu is destroyed. Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/include/asm/kvm_book3s.h |3 ++ arch/powerpc/kvm/book3s_64_mmu_hv.c | 44 +++ arch/powerpc/kvm/book3s_hv.c | 52 ++-- 3 files changed, 83 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h index bd8345f..b5ee1ce 100644 --- a/arch/powerpc/include/asm/kvm_book3s.h +++ b/arch/powerpc/include/asm/kvm_book3s.h @@ -141,6 +141,9 @@ extern void kvmppc_set_bat(struct kvm_vcpu *vcpu, struct kvmppc_bat *bat, extern void kvmppc_giveup_ext(struct kvm_vcpu *vcpu, ulong msr); extern int kvmppc_emulate_paired_single(struct kvm_run *run, struct kvm_vcpu *vcpu); extern pfn_t kvmppc_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn); +extern void *kvmppc_pin_guest_page(struct kvm *kvm, unsigned long addr, + unsigned long *nb_ret); +extern void kvmppc_unpin_guest_page(struct kvm *kvm, void *addr); extern void kvmppc_entry_trampoline(void); extern void kvmppc_hv_entry_trampoline(void); diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 99187db..9c7e825 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -480,6 +480,50 @@ int kvmppc_book3s_hv_emulate_mmio(struct kvm_run *run, struct kvm_vcpu *vcpu) return kvmppc_emulate_mmio(run, vcpu); } +void *kvmppc_pin_guest_page(struct kvm *kvm, unsigned long gpa, + unsigned long *nb_ret) +{ + struct kvm_memory_slot *memslot; + unsigned long gfn = gpa PAGE_SHIFT; + struct page *pages[1]; + int npages; + unsigned long hva, psize, offset; + unsigned long pfn; + unsigned long *pfnp; + + memslot = gfn_to_memslot(kvm, gfn); + if (!memslot || (memslot-flags KVM_MEMSLOT_INVALID) || + (memslot-flags KVM_MEMSLOT_IO)) + return NULL; + pfnp = kvmppc_pfn_entry(kvm, memslot, gfn); + if (pfnp) { + pfn = *pfnp; + if (!pfn) + return NULL; + psize = 1ul kvm-arch.slot_page_order[memslot-id]; + pages[0] = pfn_to_page(pfn); + get_page(pages[0]); + } else { + hva = gfn_to_hva_memslot(memslot, gfn); + npages = get_user_pages_fast(hva, 1, 1, pages); + if (npages 1) + return NULL; + psize = PAGE_SIZE; + } + offset = gpa (psize - 1); + if (nb_ret) + *nb_ret = psize - offset; + return page_address(pages[0]) + offset; +} + +void kvmppc_unpin_guest_page(struct kvm *kvm, void *va) +{ + struct page *page = virt_to_page(va); + + page = compound_head(page); + put_page(page); +} + void kvmppc_mmu_book3s_hv_init(struct kvm_vcpu *vcpu) { struct kvmppc_mmu *mmu = vcpu-arch.mmu; diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index cb21845..ceb49d2 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -163,10 +163,10 @@ static unsigned long do_h_register_vpa(struct kvm_vcpu *vcpu, unsigned long vcpuid, unsigned long vpa) { struct kvm *kvm = vcpu-kvm; - unsigned long ra, len; - unsigned long nb; + unsigned long len, nb; void *va; struct kvm_vcpu *tvcpu; + int err = H_PARAMETER; tvcpu = kvmppc_find_vcpu(kvm, vcpuid); if (!tvcpu) @@ -179,40 +179,41 @@ static unsigned long do_h_register_vpa(struct kvm_vcpu *vcpu, if (flags 4) { if (vpa 0x7f) return H_PARAMETER; + if (flags = 2 !tvcpu-arch.vpa) + return H_RESOURCE; /* registering new area; convert logical addr to real */ - ra = kvmppc_logical_to_real(kvm, vpa, nb); - if (!ra) + va = kvmppc_pin_guest_page(kvm, vpa, nb); + if (va == NULL) return H_PARAMETER; - va = __va(ra); if (flags = 1) len = *(unsigned short *)(va + 4); else len = *(unsigned int *)(va + 4); if (len nb) - return H_PARAMETER; + goto out_unpin; switch
[PATCH 02/11] KVM: PPC: Keep a record of HV guest view of hashed page table entries
This adds an array that parallels the guest hashed page table (HPT), that is, it has one entry per HPTE, used to store the guest's view of the second doubleword of the corresponding HPTE. The first doubleword in the HPTE is the same as the guest's idea of it, so we don't need to store a copy, but the second doubleword in the HPTE has the real page number rather than the guest's logical page number. This allows us to remove the back_translate() and reverse_xlate() functions. This reverse mapping array is vmalloc'd, meaning that to access it in real mode we have to walk the kernel's page tables explicitly. That is done by the new real_vmalloc_addr() function. (In fact this returns an address in the linear mapping, so the result is usable both in real mode and in virtual mode.) This also corrects a couple of bugs in kvmppc_mmu_get_pp_value(). Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/include/asm/kvm_book3s_64.h | 20 + arch/powerpc/include/asm/kvm_host.h | 10 ++ arch/powerpc/kvm/book3s_64_mmu_hv.c | 136 +- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 95 + 4 files changed, 147 insertions(+), 114 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h index 53692c2..63542dd 100644 --- a/arch/powerpc/include/asm/kvm_book3s_64.h +++ b/arch/powerpc/include/asm/kvm_book3s_64.h @@ -29,6 +29,14 @@ static inline struct kvmppc_book3s_shadow_vcpu *to_svcpu(struct kvm_vcpu *vcpu) #define SPAPR_TCE_SHIFT12 +#ifdef CONFIG_KVM_BOOK3S_64_HV +/* For now use fixed-size 16MB page table */ +#define HPT_ORDER 24 +#define HPT_NPTEG (1ul (HPT_ORDER - 7))/* 128B per pteg */ +#define HPT_NPTE (HPT_NPTEG 3)/* 8 PTEs per PTEG */ +#define HPT_HASH_MASK (HPT_NPTEG - 1) +#endif + static inline unsigned long compute_tlbie_rb(unsigned long v, unsigned long r, unsigned long pte_index) { @@ -86,4 +94,16 @@ static inline long try_lock_hpte(unsigned long *hpte, unsigned long bits) return old == 0; } +static inline unsigned long hpte_page_size(unsigned long h, unsigned long l) +{ + /* only handle 4k, 64k and 16M pages for now */ + if (!(h HPTE_V_LARGE)) + return 1ul 12; /* 4k page */ + if ((l 0xf000) == 0x1000 cpu_has_feature(CPU_FTR_ARCH_206)) + return 1ul 16; /* 64k page */ + if ((l 0xff000) == 0) + return 1ul 24; /* 16M page */ + return 0; /* error */ +} + #endif /* __ASM_KVM_BOOK3S_64_H__ */ diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index f142a2d..56f7046 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -166,9 +166,19 @@ struct kvmppc_rma_info { atomic_t use_count; }; +/* + * The reverse mapping array has one entry for each HPTE, + * which stores the guest's view of the second word of the HPTE + * (including the guest physical address of the mapping). + */ +struct revmap_entry { + unsigned long guest_rpte; +}; + struct kvm_arch { #ifdef CONFIG_KVM_BOOK3S_64_HV unsigned long hpt_virt; + struct revmap_entry *revmap; unsigned long ram_npages; unsigned long ram_psize; unsigned long ram_porder; diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index da8c2f4..2b9b8be 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -23,6 +23,7 @@ #include linux/gfp.h #include linux/slab.h #include linux/hugetlb.h +#include linux/vmalloc.h #include asm/tlbflush.h #include asm/kvm_ppc.h @@ -33,11 +34,6 @@ #include asm/ppc-opcode.h #include asm/cputable.h -/* For now use fixed-size 16MB page table */ -#define HPT_ORDER 24 -#define HPT_NPTEG (1ul (HPT_ORDER - 7))/* 128B per pteg */ -#define HPT_HASH_MASK (HPT_NPTEG - 1) - /* Pages in the VRMA are 16MB pages */ #define VRMA_PAGE_ORDER24 #define VRMA_VSID 0x1ffUL /* 1TB VSID reserved for VRMA */ @@ -51,7 +47,9 @@ long kvmppc_alloc_hpt(struct kvm *kvm) { unsigned long hpt; unsigned long lpid; + struct revmap_entry *rev; + /* Allocate guest's hashed page table */ hpt = __get_free_pages(GFP_KERNEL|__GFP_ZERO|__GFP_REPEAT|__GFP_NOWARN, HPT_ORDER - PAGE_SHIFT); if (!hpt) { @@ -60,12 +58,20 @@ long kvmppc_alloc_hpt(struct kvm *kvm) } kvm-arch.hpt_virt = hpt; + /* Allocate reverse map array */ + rev = vmalloc(sizeof(struct revmap_entry) * HPT_NPTE); + if (!rev) { + pr_err(kvmppc_alloc_hpt: Couldn't alloc reverse map array\n); + goto out_freehpt; + } +
[RFC PATCH 08/11] KVM: PPC: Add a page fault handler function
This adds a kvmppc_book3s_hv_page_fault function that is capable of handling the fault we get if the guest tries to access a non-present page (one that we have marked with storage key 31 and no-execute), and either doing MMIO emulation, or making the page resident and rewriting the guest HPTE to point to it, if it is RAM. We now call this for hypervisor instruction storage interrupts, and for hypervisor data storage interrupts instead of the emulate-MMIO function. It can now be called for real-mode accesses through the VRMA as well as virtual-mode accesses. In order to identify non-present HPTEs, we use a second software-use bit in the first dword of the HPTE, called HPTE_V_ABSENT. We can't just look for storage key 31 because non-present HPTEs for the VRMA have to be actually invalid, as the storage key mechanism doesn't operate in real mode. Using this bit also means that we don't have to restrict the guest from using key 31 any more. Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/include/asm/kvm_book3s.h|6 +- arch/powerpc/include/asm/kvm_book3s_64.h | 11 ++- arch/powerpc/include/asm/kvm_host.h | 30 ++-- arch/powerpc/kvm/book3s_64_mmu_hv.c | 259 +++--- arch/powerpc/kvm/book3s_hv.c | 54 -- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 121 -- 6 files changed, 340 insertions(+), 141 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h index b5ee1ce..ac48438 100644 --- a/arch/powerpc/include/asm/kvm_book3s.h +++ b/arch/powerpc/include/asm/kvm_book3s.h @@ -121,7 +121,9 @@ extern void kvmppc_mmu_book3s_hv_init(struct kvm_vcpu *vcpu); extern int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct kvmppc_pte *pte); extern int kvmppc_mmu_map_segment(struct kvm_vcpu *vcpu, ulong eaddr); extern void kvmppc_mmu_flush_segments(struct kvm_vcpu *vcpu); -extern int kvmppc_book3s_hv_emulate_mmio(struct kvm_run *run, struct kvm_vcpu *vcpu); +extern int kvmppc_book3s_hv_page_fault(struct kvm_run *run, + struct kvm_vcpu *vcpu, unsigned long addr, + unsigned long status); extern void kvmppc_mmu_hpte_cache_map(struct kvm_vcpu *vcpu, struct hpte_cache *pte); extern struct hpte_cache *kvmppc_mmu_hpte_cache_next(struct kvm_vcpu *vcpu); @@ -141,6 +143,8 @@ extern void kvmppc_set_bat(struct kvm_vcpu *vcpu, struct kvmppc_bat *bat, extern void kvmppc_giveup_ext(struct kvm_vcpu *vcpu, ulong msr); extern int kvmppc_emulate_paired_single(struct kvm_run *run, struct kvm_vcpu *vcpu); extern pfn_t kvmppc_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn); +extern void kvmppc_modify_hpte(struct kvm *kvm, unsigned long *hptep, + unsigned long new_hpte[2], unsigned long pte_index); extern void *kvmppc_pin_guest_page(struct kvm *kvm, unsigned long addr, unsigned long *nb_ret); extern void kvmppc_unpin_guest_page(struct kvm *kvm, void *addr); diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h index 307e649..3745337 100644 --- a/arch/powerpc/include/asm/kvm_book3s_64.h +++ b/arch/powerpc/include/asm/kvm_book3s_64.h @@ -37,6 +37,8 @@ static inline struct kvmppc_book3s_shadow_vcpu *to_svcpu(struct kvm_vcpu *vcpu) #define HPT_HASH_MASK (HPT_NPTEG - 1) #endif +#define VRMA_VSID 0x1ffUL /* 1TB VSID reserved for VRMA */ + static inline unsigned long compute_tlbie_rb(unsigned long v, unsigned long r, unsigned long pte_index) { @@ -72,9 +74,11 @@ static inline unsigned long compute_tlbie_rb(unsigned long v, unsigned long r, /* * We use a lock bit in HPTE dword 0 to synchronize updates and - * accesses to each HPTE. + * accesses to each HPTE, and another bit to indicate non-present + * HPTEs. */ #define HPTE_V_HVLOCK 0x40UL +#define HPTE_V_ABSENT 0x20UL static inline long try_lock_hpte(unsigned long *hpte, unsigned long bits) { @@ -106,6 +110,11 @@ static inline unsigned long hpte_page_size(unsigned long h, unsigned long l) return 0; /* error */ } +static inline unsigned long hpte_rpn(unsigned long ptel, unsigned long psize) +{ + return ((ptel HPTE_R_RPN) ~(psize - 1)) PAGE_SHIFT; +} + #ifdef CONFIG_KVM_BOOK3S_64_HV static inline unsigned long *kvmppc_pfn_entry(struct kvm *kvm, struct kvm_memory_slot *memslot, unsigned long gfn) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index f211643..ababf17 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -162,6 +162,20 @@ struct kvmppc_rma_info { atomic_t use_count; }; +struct kvmppc_slb { + u64 esid; + u64 vsid; + u64 orige; + u64 origv; + bool valid : 1; + bool Ks : 1; + bool Kp : 1; +
[PATCH 03/11] KVM: PPC: Allow use of small pages to back guest memory
From: Nishanth Aravamudan n...@us.ibm.com This puts the page frame numbers for the memory backing the guest in the slot-rmap array for each slot, rather than using the ram_pginfo array. Since the rmap array is vmalloc'd, we use real_vmalloc_addr() to access it when we access it in real mode in kvmppc_h_enter(). The rmap array contains one PFN for each small page, even if the backing memory is large pages. This lets us get rid of the ram_pginfo array. [pau...@samba.org - Cleaned up and reorganized a bit, abstracted out HPTE page size encoding functions, added check that memory being added in kvmppc_core_prepare_memory_region is all in one VMA.] Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/include/asm/kvm_host.h |8 -- arch/powerpc/kvm/book3s_64_mmu_hv.c | 47 +++ arch/powerpc/kvm/book3s_hv.c| 153 +-- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 90 ++-- 4 files changed, 151 insertions(+), 147 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 56f7046..52fd741 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -145,11 +145,6 @@ struct kvmppc_exit_timing { }; }; -struct kvmppc_pginfo { - unsigned long pfn; - atomic_t refcnt; -}; - struct kvmppc_spapr_tce_table { struct list_head list; struct kvm *kvm; @@ -179,17 +174,14 @@ struct kvm_arch { #ifdef CONFIG_KVM_BOOK3S_64_HV unsigned long hpt_virt; struct revmap_entry *revmap; - unsigned long ram_npages; unsigned long ram_psize; unsigned long ram_porder; - struct kvmppc_pginfo *ram_pginfo; unsigned int lpid; unsigned int host_lpid; unsigned long host_lpcr; unsigned long sdr1; unsigned long host_sdr1; int tlbie_lock; - int n_rma_pages; unsigned long lpcr; unsigned long rmor; struct kvmppc_rma_info *rma; diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 2b9b8be..bed6c61 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -34,8 +34,6 @@ #include asm/ppc-opcode.h #include asm/cputable.h -/* Pages in the VRMA are 16MB pages */ -#define VRMA_PAGE_ORDER24 #define VRMA_VSID 0x1ffUL /* 1TB VSID reserved for VRMA */ /* POWER7 has 10-bit LPIDs, PPC970 has 6-bit LPIDs */ @@ -95,19 +93,33 @@ void kvmppc_free_hpt(struct kvm *kvm) free_pages(kvm-arch.hpt_virt, HPT_ORDER - PAGE_SHIFT); } +/* Bits in first HPTE dword for pagesize 4k, 64k or 16M */ +static inline unsigned long hpte0_pgsize_encoding(unsigned long pgsize) +{ + return (pgsize 0x1000) ? HPTE_V_LARGE : 0; +} + +/* Bits in second HPTE dword for pagesize 4k, 64k or 16M */ +static inline unsigned long hpte1_pgsize_encoding(unsigned long pgsize) +{ + return (pgsize == 0x1) ? 0x1000 : 0; +} + void kvmppc_map_vrma(struct kvm *kvm, struct kvm_userspace_memory_region *mem) { unsigned long i; - unsigned long npages = kvm-arch.ram_npages; + unsigned long npages; unsigned long pfn; unsigned long *hpte; - unsigned long hash; + unsigned long addr, hash; + unsigned long psize = kvm-arch.ram_psize; unsigned long porder = kvm-arch.ram_porder; struct revmap_entry *rev; - struct kvmppc_pginfo *pginfo = kvm-arch.ram_pginfo; + struct kvm_memory_slot *memslot; + unsigned long hp0, hp1; - if (!pginfo) - return; + memslot = kvm-memslots-memslots[mem-slot]; + npages = memslot-npages (porder - PAGE_SHIFT); /* VRMA can't be 1TB */ if (npages 1ul (40 - porder)) @@ -116,10 +128,16 @@ void kvmppc_map_vrma(struct kvm *kvm, struct kvm_userspace_memory_region *mem) if (npages HPT_NPTEG) npages = HPT_NPTEG; + hp0 = HPTE_V_1TB_SEG | (VRMA_VSID (40 - 16)) | + HPTE_V_BOLTED | hpte0_pgsize_encoding(psize) | HPTE_V_VALID; + hp1 = hpte1_pgsize_encoding(psize) | + HPTE_R_R | HPTE_R_C | HPTE_R_M | PP_RWXX; + for (i = 0; i npages; ++i) { - pfn = pginfo[i].pfn; + pfn = memslot-rmap[i (porder - PAGE_SHIFT)]; if (!pfn) - break; + continue; + addr = i porder; /* can't use hpt_hash since va 64 bits */ hash = (i ^ (VRMA_VSID ^ (VRMA_VSID 25))) HPT_HASH_MASK; /* @@ -131,17 +149,14 @@ void kvmppc_map_vrma(struct kvm *kvm, struct kvm_userspace_memory_region *mem) hash = (hash 3) + 7; hpte = (unsigned long *) (kvm-arch.hpt_virt + (hash 4)); /* HPTE low word - RPN, protection, etc. */ - hpte[1] = (pfn PAGE_SHIFT) | HPTE_R_R | HPTE_R_C | -
[RFC PATCH 06/11] KVM: PPC: Use Linux page tables in h_enter and map_vrma
This changes kvmppc_h_enter() and kvmppc_map_vrma to get the real page numbers that they put into the guest HPT from the Linux page tables for our userspace as an alternative to getting them from the slot_pfns arrays. In future this will enable us to avoid pinning all of guest memory on POWER7, but we will still have to pin all guest memory on PPC970 as it doesn't support virtual partition memory. This also exports find_linux_pte_or_hugepte() since we need it when KVM is modular. Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/include/asm/kvm_book3s_64.h | 31 +++ arch/powerpc/include/asm/kvm_host.h |2 + arch/powerpc/kvm/book3s_64_mmu_hv.c | 26 +- arch/powerpc/kvm/book3s_hv.c |1 + arch/powerpc/kvm/book3s_hv_rm_mmu.c | 127 -- arch/powerpc/mm/hugetlbpage.c|2 + 6 files changed, 125 insertions(+), 64 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h index 9243f35..307e649 100644 --- a/arch/powerpc/include/asm/kvm_book3s_64.h +++ b/arch/powerpc/include/asm/kvm_book3s_64.h @@ -121,4 +121,35 @@ static inline unsigned long *kvmppc_pfn_entry(struct kvm *kvm, } #endif /* CONFIG_KVM_BOOK3S_64_HV */ +/* + * Lock and read a linux PTE. If it's present and writable, atomically + * set dirty and referenced bits and return the PFN, otherwise return 0. + */ +static inline unsigned long kvmppc_read_update_linux_pte(pte_t *p) +{ + pte_t pte, tmp; + unsigned long pfn = 0; + + /* wait until _PAGE_BUSY is clear then set it atomically */ + __asm__ __volatile__ ( + 1: ldarx %0,0,%3\n + andi. %1,%0,%4\n + bne-1b\n + ori %1,%0,%4\n + stdcx. %1,0,%3\n + bne-1b + : =r (pte), =r (tmp), =m (*p) + : r (p), i (_PAGE_BUSY) + : cc); + + if (pte_present(pte) pte_write(pte)) { + pfn = pte_pfn(pte); + pte = pte_mkdirty(pte_mkyoung(pte)); + } + + *p = pte; /* clears _PAGE_BUSY */ + + return pfn; +} + #endif /* __ASM_KVM_BOOK3S_64_H__ */ diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 93b7e04..f211643 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -32,6 +32,7 @@ #include linux/atomic.h #include asm/kvm_asm.h #include asm/processor.h +#include asm/page.h #define KVM_MAX_VCPUS NR_CPUS #define KVM_MAX_VCORES NR_CPUS @@ -432,6 +433,7 @@ struct kvm_vcpu_arch { struct list_head run_list; struct task_struct *run_task; struct kvm_run *kvm_run; + pgd_t *pgdir; #endif }; diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 4d558c4..99187db 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -111,13 +111,15 @@ void kvmppc_map_vrma(struct kvm *kvm, struct kvm_userspace_memory_region *mem) unsigned long npages; unsigned long pfn; unsigned long *hpte; - unsigned long addr, hash; + unsigned long addr, hash, hva; unsigned long psize; int porder; struct revmap_entry *rev; struct kvm_memory_slot *memslot; unsigned long hp0, hp1; unsigned long *pfns; + pte_t *p; + unsigned int shift; memslot = kvm-memslots-memslots[mem-slot]; pfns = kvm-arch.slot_pfns[mem-slot]; @@ -138,10 +140,26 @@ void kvmppc_map_vrma(struct kvm *kvm, struct kvm_userspace_memory_region *mem) HPTE_R_R | HPTE_R_C | HPTE_R_M | PP_RWXX; for (i = 0; i npages; ++i) { - pfn = pfns[i]; - if (!pfn) - continue; addr = i porder; + if (pfns) { + pfn = pfns[i]; + } else { + pfn = 0; + local_irq_disable(); + hva = addr + mem-userspace_addr; + p = find_linux_pte_or_hugepte(current-mm-pgd, hva, + shift); + if (p (psize == PAGE_SIZE || shift == porder)) + pfn = kvmppc_read_update_linux_pte(p); + local_irq_enable(); + } + + if (!pfn) { + pr_err(KVM: Couldn't find page for VRMA at %lx\n, + addr); + break; + } + /* can't use hpt_hash since va 64 bits */ hash = (i ^ (VRMA_VSID ^ (VRMA_VSID 25))) HPT_HASH_MASK; /* diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 7434258..cb21845 100644
[RFC PATCH 09/11] KVM: PPC: Maintain a doubly-linked list of guest HPTEs for each gfn
This expands the reverse mapping array to contain two links for each HPTE which are used to link together HPTEs that correspond to the same guest logical page. Each circular list of HPTEs is pointed to by the rmap array entry for the guest logical page, pointed to by the relevant memslot. Links are 32-bit HPT entry indexes rather than full 64-bit pointers, to save space. We use 3 of the remaining 32 bits in the rmap array entries as a lock bit, a referenced bit and a present bit (the present bit is needed since HPTE index 0 is valid). The bit lock for the rmap chain nests inside the HPTE lock bit. Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/include/asm/kvm_book3s.h |2 + arch/powerpc/include/asm/kvm_host.h | 17 ++- arch/powerpc/kvm/book3s_64_mmu_hv.c |8 +++ arch/powerpc/kvm/book3s_hv_rm_mmu.c | 88 - 4 files changed, 113 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h index ac48438..8454a82 100644 --- a/arch/powerpc/include/asm/kvm_book3s.h +++ b/arch/powerpc/include/asm/kvm_book3s.h @@ -143,6 +143,8 @@ extern void kvmppc_set_bat(struct kvm_vcpu *vcpu, struct kvmppc_bat *bat, extern void kvmppc_giveup_ext(struct kvm_vcpu *vcpu, ulong msr); extern int kvmppc_emulate_paired_single(struct kvm_run *run, struct kvm_vcpu *vcpu); extern pfn_t kvmppc_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn); +extern void kvmppc_add_revmap_chain(struct kvm *kvm, struct revmap_entry *rev, + unsigned long *rmap, long pte_index, int realmode); extern void kvmppc_modify_hpte(struct kvm *kvm, unsigned long *hptep, unsigned long new_hpte[2], unsigned long pte_index); extern void *kvmppc_pin_guest_page(struct kvm *kvm, unsigned long addr, diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index ababf17..3dfac3d 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -179,12 +179,27 @@ struct kvmppc_slb { /* * The reverse mapping array has one entry for each HPTE, * which stores the guest's view of the second word of the HPTE - * (including the guest physical address of the mapping). + * (including the guest physical address of the mapping), + * plus forward and backward pointers in a doubly-linked ring + * of HPTEs that map the same host page. The pointers in this + * ring are 32-bit HPTE indexes, to save space. */ struct revmap_entry { unsigned long guest_rpte; + unsigned int forw, back; }; +/* + * We use the top bit of each memslot-rmap entry as a lock bit, + * and bit 32 as a present flag. The bottom 32 bits are the + * index in the guest HPT of a HPTE that points to the page. + */ +#define KVMPPC_RMAP_LOCK_BIT 63 +#define KVMPPC_RMAP_REF_BIT33 +#define KVMPPC_RMAP_REFERENCED (1ul KVMPPC_RMAP_REF_BIT) +#define KVMPPC_RMAP_PRESENT0x1ul +#define KVMPPC_RMAP_INDEX 0xul + struct kvm_arch { #ifdef CONFIG_KVM_BOOK3S_64_HV unsigned long hpt_virt; diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 32c7d8c..e93c789 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -179,6 +179,11 @@ void kvmppc_map_vrma(struct kvm *kvm, struct kvm_userspace_memory_region *mem) /* Reverse map info */ rev = kvm-arch.revmap[hash]; rev-guest_rpte = hp1 | addr; + if (pfn) { + rev-forw = rev-back = hash; + memslot-rmap[i (porder - PAGE_SHIFT)] = hash | + KVMPPC_RMAP_REFERENCED | KVMPPC_RMAP_PRESENT; + } } } @@ -504,6 +509,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, unsigned long psize, pte_size; unsigned long gfn, hva, pfn, amr; struct kvm_memory_slot *memslot; + unsigned long *rmap; struct revmap_entry *rev; struct page *page, *pages[1]; unsigned int pp, ok; @@ -605,6 +611,8 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, hpte[0] = (hpte[0] ~HPTE_V_ABSENT) | HPTE_V_VALID; hpte[1] = (rev-guest_rpte ~(HPTE_R_PP0 - pte_size)) | (pfn PAGE_SHIFT); + rmap = memslot-rmap[gfn - memslot-base_gfn]; + kvmppc_add_revmap_chain(kvm, rev, rmap, index, 0); kvmppc_modify_hpte(kvm, hptep, hpte, index); if (page) SetPageDirty(page); diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index b477e68..622bfcd 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -57,6 +57,77 @@ static struct kvm_memory_slot *builtin_gfn_to_memslot(struct kvm *kvm, return NULL; } +static void lock_rmap(unsigned long *rmap)
[PATCH 05/11] KVM: PPC: Use a separate vmalloc'd array to store pfns
This changes the book3s_hv code to store the page frame numbers in a separate vmalloc'd array, pointed to by an array in struct kvm_arch, rather than the memslot-rmap arrays. This frees up the rmap arrays to be used later to store reverse mapping information. For large page regions, we now store only one pfn per large page rather than one pfn per small page. This reduces the size of the pfns arrays and eliminates redundant get_page and put_page calls. We also now pin the guest pages and store the pfns in the commit_memory function rather than the prepare_memory function. This avoids a memory leak should the add memory procedure hit an error after calling the prepare_memory function. Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/include/asm/kvm_book3s_64.h | 15 arch/powerpc/include/asm/kvm_host.h |4 +- arch/powerpc/kvm/book3s_64_mmu_hv.c | 10 ++- arch/powerpc/kvm/book3s_hv.c | 124 +++--- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 14 ++-- 5 files changed, 112 insertions(+), 55 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h index 63542dd..9243f35 100644 --- a/arch/powerpc/include/asm/kvm_book3s_64.h +++ b/arch/powerpc/include/asm/kvm_book3s_64.h @@ -106,4 +106,19 @@ static inline unsigned long hpte_page_size(unsigned long h, unsigned long l) return 0; /* error */ } +#ifdef CONFIG_KVM_BOOK3S_64_HV +static inline unsigned long *kvmppc_pfn_entry(struct kvm *kvm, + struct kvm_memory_slot *memslot, unsigned long gfn) +{ + int id = memslot-id; + unsigned long index; + + if (!kvm-arch.slot_pfns[id]) + return NULL; + index = gfn - memslot-base_gfn; + index = kvm-arch.slot_page_order[id] - PAGE_SHIFT; + return kvm-arch.slot_pfns[id][index]; +} +#endif /* CONFIG_KVM_BOOK3S_64_HV */ + #endif /* __ASM_KVM_BOOK3S_64_H__ */ diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index e0751e5..93b7e04 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -174,8 +174,6 @@ struct kvm_arch { #ifdef CONFIG_KVM_BOOK3S_64_HV unsigned long hpt_virt; struct revmap_entry *revmap; - unsigned long ram_psize; - unsigned long ram_porder; unsigned int lpid; unsigned int host_lpid; unsigned long host_lpcr; @@ -186,6 +184,8 @@ struct kvm_arch { unsigned long rmor; struct kvmppc_rma_info *rma; struct list_head spapr_tce_tables; + unsigned long *slot_pfns[KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS]; + int slot_page_order[KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS]; unsigned short last_vcpu[NR_CPUS]; struct kvmppc_vcore *vcores[KVM_MAX_VCORES]; #endif /* CONFIG_KVM_BOOK3S_64_HV */ diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index bed6c61..4d558c4 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -112,13 +112,17 @@ void kvmppc_map_vrma(struct kvm *kvm, struct kvm_userspace_memory_region *mem) unsigned long pfn; unsigned long *hpte; unsigned long addr, hash; - unsigned long psize = kvm-arch.ram_psize; - unsigned long porder = kvm-arch.ram_porder; + unsigned long psize; + int porder; struct revmap_entry *rev; struct kvm_memory_slot *memslot; unsigned long hp0, hp1; + unsigned long *pfns; memslot = kvm-memslots-memslots[mem-slot]; + pfns = kvm-arch.slot_pfns[mem-slot]; + porder = kvm-arch.slot_page_order[mem-slot]; + psize = 1ul porder; npages = memslot-npages (porder - PAGE_SHIFT); /* VRMA can't be 1TB */ @@ -134,7 +138,7 @@ void kvmppc_map_vrma(struct kvm *kvm, struct kvm_userspace_memory_region *mem) HPTE_R_R | HPTE_R_C | HPTE_R_M | PP_RWXX; for (i = 0; i npages; ++i) { - pfn = memslot-rmap[i (porder - PAGE_SHIFT)]; + pfn = pfns[i]; if (!pfn) continue; addr = i porder; diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 48a0648..7434258 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -133,16 +133,40 @@ static void init_vpa(struct kvm_vcpu *vcpu, struct lppaca *vpa) vpa-yield_count = 1; } +unsigned long kvmppc_logical_to_real(struct kvm *kvm, unsigned long gpa, +unsigned long *nb_ret) +{ + struct kvm_memory_slot *memslot; + unsigned long gfn, ra, offset; + unsigned long *pfnp; + unsigned long pg_size; + + gfn = gpa PAGE_SHIFT; + memslot = gfn_to_memslot(kvm, gfn); + if (!memslot || (memslot-flags KVM_MEMSLOT_INVALID)) +
[PATCH 01/11] KVM: PPC: Add memory-mapping support for PCI passthrough and emulation
From: Benjamin Herrenschmidt b...@kernel.crashing.org This adds support for adding PCI device I/O regions to the guest memory map, and for trapping guest accesses to emulated MMIO regions and delivering them to qemu for MMIO emulation. To trap guest accesses to emulated MMIO regions, we reserve key 31 for the hypervisor's use and set the VPM1 bit in LPCR, which sends all page faults to the host. Any page fault that is not a key fault gets reflected immediately to the guest. We set HPTEs for emulated MMIO regions to have key = 31, and don't allow the guest to create HPTEs with key = 31. Any page fault that is a key fault with key = 31 is then a candidate for MMIO emulation and thus gets sent up to qemu. We also load the instruction that caused the fault for use later when qemu has done the emulation. [pau...@samba.org: Cleaned up, moved kvmppc_book3s_hv_emulate_mmio() to book3s_64_mmu_hv.c] Signed-off-by: Benjamin Herrenschmidt b...@kernel.crashing.org Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/include/asm/kvm_book3s.h|1 + arch/powerpc/include/asm/kvm_book3s_64.h | 24 +++ arch/powerpc/include/asm/kvm_host.h |2 + arch/powerpc/include/asm/kvm_ppc.h |1 + arch/powerpc/include/asm/reg.h |4 + arch/powerpc/kernel/exceptions-64s.S |8 +- arch/powerpc/kvm/book3s_64_mmu_hv.c | 301 +- arch/powerpc/kvm/book3s_hv.c | 91 +++-- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 153 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 131 - arch/powerpc/kvm/book3s_pr.c |1 + arch/powerpc/kvm/booke.c |1 + arch/powerpc/kvm/powerpc.c |2 +- include/linux/kvm.h |3 + 14 files changed, 656 insertions(+), 67 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h index deb8a4e..bd8345f 100644 --- a/arch/powerpc/include/asm/kvm_book3s.h +++ b/arch/powerpc/include/asm/kvm_book3s.h @@ -121,6 +121,7 @@ extern void kvmppc_mmu_book3s_hv_init(struct kvm_vcpu *vcpu); extern int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct kvmppc_pte *pte); extern int kvmppc_mmu_map_segment(struct kvm_vcpu *vcpu, ulong eaddr); extern void kvmppc_mmu_flush_segments(struct kvm_vcpu *vcpu); +extern int kvmppc_book3s_hv_emulate_mmio(struct kvm_run *run, struct kvm_vcpu *vcpu); extern void kvmppc_mmu_hpte_cache_map(struct kvm_vcpu *vcpu, struct hpte_cache *pte); extern struct hpte_cache *kvmppc_mmu_hpte_cache_next(struct kvm_vcpu *vcpu); diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h index d0ac94f..53692c2 100644 --- a/arch/powerpc/include/asm/kvm_book3s_64.h +++ b/arch/powerpc/include/asm/kvm_book3s_64.h @@ -62,4 +62,28 @@ static inline unsigned long compute_tlbie_rb(unsigned long v, unsigned long r, return rb; } +/* + * We use a lock bit in HPTE dword 0 to synchronize updates and + * accesses to each HPTE. + */ +#define HPTE_V_HVLOCK 0x40UL + +static inline long try_lock_hpte(unsigned long *hpte, unsigned long bits) +{ + unsigned long tmp, old; + + asm volatile( ldarx %0,0,%2\n + and.%1,%0,%3\n + bne 2f\n + ori %0,%0,%4\n + stdcx. %0,0,%2\n + beq+2f\n + li %1,%3\n +2:isync +: =r (tmp), =r (old) +: r (hpte), r (bits), i (HPTE_V_HVLOCK) +: cc, memory); + return old == 0; +} + #endif /* __ASM_KVM_BOOK3S_64_H__ */ diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index bf8af5d..f142a2d 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -186,6 +186,8 @@ struct kvm_arch { struct list_head spapr_tce_tables; unsigned short last_vcpu[NR_CPUS]; struct kvmppc_vcore *vcores[KVM_MAX_VCORES]; + unsigned long io_slot_pfn[KVM_MEMORY_SLOTS + + KVM_PRIVATE_MEM_SLOTS]; #endif /* CONFIG_KVM_BOOK3S_64_HV */ }; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index a284f20..8c372b9 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -132,6 +132,7 @@ extern void kvm_release_rma(struct kvmppc_rma_info *ri); extern int kvmppc_core_init_vm(struct kvm *kvm); extern void kvmppc_core_destroy_vm(struct kvm *kvm); extern int kvmppc_core_prepare_memory_region(struct kvm *kvm, + struct kvm_memory_slot *memslot, struct kvm_userspace_memory_region *mem); extern void kvmppc_core_commit_memory_region(struct kvm *kvm, struct kvm_userspace_memory_region *mem); diff