Re: [Qemu-devel] 9p broken?
Avi Kivity writes: > Having an annoying bug on i386 kvm I decided to debug it buy running an > i386 guest on my x86_64 host, use 9p to access a guest image, and run it > using nested kvm. > > However, 9p appears to be broken: first, the configure test fails (patch > sent). Second, while mount works, ls on the mount point causes qemu to > crash with I missed that you have already sent a patch for configure fix. That looks better that what i sent. I will ack that patch > > (gdb) bt > #0 error_set (errp=0x7fffe95fb128, fmt=0x558d4568 "{ 'class': > 'VirtFSFeatureBlocksMigration', 'data': { 'path': %s, 'tag': %s } }") at > /home/tlv/akivity/qemu/error.c:32 > #1 0x5567cb06 in v9fs_attach (opaque=0x7fffe95e3020) at > /home/tlv/akivity/qemu/hw/9pfs/virtio-9p.c:988 > #2 0x5561d19f in coroutine_trampoline (i0=1449767888, i1=21845) > at /home/tlv/akivity/qemu/coroutine-ucontext.c:138 > #3 0x75a93ef0 in ?? () from /lib64/libc.so.6 > #4 0x7fffce00 in ?? () > #5 0x in ?? ( > > **errp already points to a VirtFSFeatureBlocksMigration error; > v9fs_attach() has been called a second time (the first time, > understandably, on mount; the second on ls). > Why are we calling attach a second time ?. I am also not able to reproduce this root@qemu-img-64:~# mount -t 9p -otrans=virtio,version=9p2000.L v_tmp /mnt root@qemu-img-64:~# ls /mnt/a.c /mnt/a.c > Command line: > > qemu-system-x86_64 -m 4G -smp 4 -drive > file=/images/Fedora-i386.img,if=virtio,cache=none -cdrom > /images/iso/bfo.iso -device virtio-9p-pci,fsdev=root,mount_tag=root > -fsdev local,id=root,path=/,security_model=passthrough -enable-kvm -net > nic,model=virtio,netdev=net -netdev user,id=net -monitor stdio -cpu host > -aneesh -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] 9p broken?
"Richard W.M. Jones" writes: > On Mon, Jul 30, 2012 at 03:35:39PM +0300, Avi Kivity wrote: >> Having an annoying bug on i386 kvm I decided to debug it buy running an >> i386 guest on my x86_64 host, use 9p to access a guest image, and run it >> using nested kvm. >> >> However, 9p appears to be broken: first, the configure test fails (patch >> sent). Second, while mount works, ls on the mount point causes qemu to >> crash with >> >> (gdb) bt >> #0 error_set (errp=0x7fffe95fb128, fmt=0x558d4568 "{ 'class': >> 'VirtFSFeatureBlocksMigration', 'data': { 'path': %s, 'tag': %s } }") at >> /home/tlv/akivity/qemu/error.c:32 >> #1 0x5567cb06 in v9fs_attach (opaque=0x7fffe95e3020) at >> /home/tlv/akivity/qemu/hw/9pfs/virtio-9p.c:988 >> #2 0x5561d19f in coroutine_trampoline (i0=1449767888, i1=21845) >> at /home/tlv/akivity/qemu/coroutine-ucontext.c:138 >> #3 0x75a93ef0 in ?? () from /lib64/libc.so.6 >> #4 0x7fffce00 in ?? () >> #5 0x in ?? ( >> >> **errp already points to a VirtFSFeatureBlocksMigration error; >> v9fs_attach() has been called a second time (the first time, >> understandably, on mount; the second on ls). > > Yes, I can reproduce this too. > > LIBGUESTFS_QEMU=~/d/qemu/qemu.wrapper \ > guestfish -v -- \ > sparse /tmp/unused 100M : \ > config -device 'virtio-9p-pci,fsdev=root,mount_tag=root' : \ > config -fsdev 'local,id=root,path=/tmp,security_model=passthrough' : \ > run : \ > mount-9p root / : \ > ls / > > Stack trace: > > #0 0x7fb1d4d19ba5 in __GI_raise (sig=sig@entry=6) > at ../nptl/sysdeps/unix/sysv/linux/raise.c:63 > #1 0x7fb1d4d1b358 in __GI_abort () at abort.c:90 > #2 0x7fb1d4d12972 in __assert_fail_base (fmt= > 0x7fb1d4e5c8e8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", > assertion=assertion@entry=0x7fb1d8e2e87e "*errp == ((void *)0)", > file=file@entry=0x7fb1d8e56217 "error.c", line=line@entry=35, > function=function@entry= > 0x7fb1d8e2e8ca <__PRETTY_FUNCTION__.13983> "error_set") at assert.c:92 > #3 0x7fb1d4d12a22 in __GI___assert_fail (assertion=assertion@entry= > 0x7fb1d8e2e87e "*errp == ((void *)0)", file=file@entry= > 0x7fb1d8e56217 "error.c", line=line@entry=35, function=function@entry= > 0x7fb1d8e2e8ca <__PRETTY_FUNCTION__.13983> "error_set") at assert.c:101 > #4 0x7fb1d8c1147f in error_set (errp=errp@entry=0x7fb1ce36a128, > fmt=fmt@entry= > 0x7fb1d8e39a78 "{ 'class': 'VirtFSFeatureBlocksMigration', 'data': { > 'path': %s, 'tag': %s } }") at error.c:35 > #5 0x7fb1d8c57f9b in v9fs_attach (opaque=0x7fb1ce352020) > at /home/rjones/d/qemu/hw/9pfs/virtio-9p.c:988 > #6 0x7fb1d8c0fcfa in coroutine_trampoline (i0=, > i1=) at coroutine-ucontext.c:138 > #7 0x7fb1d4d2a2f0 in ?? () from /lib64/libc.so.6 > #8 0x7fff51061aa0 in ?? () > #9 0xb5b5b5b5b5b5b5b5 in ?? () > #10 0x in ?? () > > I'll add a regression test for 9p to libguestfs so at least we will > catch this in future during Fedora builds. > I am not able to reproduce this, I had to patch configure to fix some compile errors, but then I am not hitting the crash. I am using the latest qemu. We do the below in 9p error_set(&s->migration_blocker, QERR_VIRTFS_FEATURE_BLOCKS_MIGRATION, s->ctx.fs_root ? s->ctx.fs_root : "NULL", s->tag); I am not sure how we can hit that assert() in error_set() ?. We allocate s via s = (V9fsState *)virtio_common_init() which does vdev = g_malloc0(struct_size); -aneesh -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm tools: don't bother tracking is_dir
On Sun, 22 Jul 2012, Sasha Levin wrote: > This is something we can calculate on the fly, and doesn't justify the > overhead > of tracking it all over fid transitions. > > Signed-off-by: Sasha Levin What kind of overhead are you talking about? > --- > tools/kvm/include/kvm/virtio-9p.h |1 - > tools/kvm/virtio/9p.c | 16 +++- > 2 files changed, 11 insertions(+), 6 deletions(-) > > diff --git a/tools/kvm/include/kvm/virtio-9p.h > b/tools/kvm/include/kvm/virtio-9p.h > index 186fe05..cb590d1 100644 > --- a/tools/kvm/include/kvm/virtio-9p.h > +++ b/tools/kvm/include/kvm/virtio-9p.h > @@ -26,7 +26,6 @@ struct p9_msg { > struct p9_fid { > u32 fid; > u32 uid; > - u8 is_dir; > charabs_path[PATH_MAX]; > char*path; > DIR *dir; > diff --git a/tools/kvm/virtio/9p.c b/tools/kvm/virtio/9p.c > index c663dab..2cf99a0 100644 > --- a/tools/kvm/virtio/9p.c > +++ b/tools/kvm/virtio/9p.c > @@ -228,6 +228,15 @@ static int virtio_p9_openflags(int flags) > return flags; > } > > +static bool is_dir(struct p9_fid *fid) > +{ > + struct stat st; > + > + stat(fid->abs_path, &st); > + > + return S_ISDIR(st.st_mode); > +} > + > static void virtio_p9_open(struct p9_dev *p9dev, > struct p9_pdu *pdu, u32 *outlen) > { > @@ -245,7 +254,7 @@ static void virtio_p9_open(struct p9_dev *p9dev, > > stat2qid(&st, &qid); > > - if (new_fid->is_dir) { > + if (is_dir(new_fid)) { > new_fid->dir = opendir(new_fid->abs_path); > if (!new_fid->dir) > goto err_out; > @@ -394,7 +403,6 @@ static void virtio_p9_walk(struct p9_dev *p9dev, > goto err_out; > > stat2qid(&st, &wqid); > - new_fid->is_dir = S_ISDIR(st.st_mode); > strcpy(new_fid->path, tmp); > new_fid->uid = fid->uid; > nwqid++; > @@ -406,7 +414,6 @@ static void virtio_p9_walk(struct p9_dev *p9dev, >*/ > pdu->write_offset += sizeof(u16); > old_fid = get_fid(p9dev, fid_val); > - new_fid->is_dir = old_fid->is_dir; > strcpy(new_fid->path, old_fid->path); > new_fid->uid= old_fid->uid; > } > @@ -445,7 +452,6 @@ static void virtio_p9_attach(struct p9_dev *p9dev, > > fid = get_fid(p9dev, fid_val); > fid->uid = uid; > - fid->is_dir = 1; > strcpy(fid->path, "/"); > > virtio_p9_pdu_writef(pdu, "Q", &qid); > @@ -547,7 +553,7 @@ static void virtio_p9_readdir(struct p9_dev *p9dev, > virtio_p9_pdu_readf(pdu, "dqd", &fid_val, &offset, &count); > fid = get_fid(p9dev, fid_val); > > - if (!fid->is_dir) { > + if (!is_dir(fid)) { > errno = EINVAL; > goto err_out; > } > -- > 1.7.8.6 > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH] Use kernel supplied MMU info for kvm tool
On Wed, 18 Jul 2012, Michael Ellerman wrote: > It occurred to me overnight that I forgot to mention that in order to > build the new code you need the headers from a 3.5-rc1 era kernel (for > the ioctl & KVM_CAP definitions). > > The easiest way to do that is to merge linus' tree into kvmtool. > > Are you planning on doing that in the master kvmtool tree anytime soon? > It's still based on 3.4-rc1 it seems. Done. Sorry for the delay! -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL (PATCH 0/4)] VFIO driver for v3.6
On Mon, Jul 30, 2012 at 4:17 PM, Alex Williamson wrote: > > I'm pretty anxious to find out as well. Linus, ping, any thoughts on > including this in 3.6? Thanks, I just pulled it, but then I unpulled again when I realized it's not a signed tag and it's on github. Please, people. Do tagged releases with proper signatures if you're not using kernel.org or other controlled servers. In fact, I prefer signed tags even if you *do* use kernel.org etc. Linus -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM: MMU: Tracking guest writes through EPT entries ?
On 07/31/2012 01:18 AM, Sunil wrote: > Hello List, > > I am a KVM newbie and studying KVM mmu code. > > On the existing guest, I am trying to track all guest writes by > marking page table entry as read-only in EPT entry [ I am using Intel > machine with vmx and ept support ]. Looks like EPT support re-uses > shadow page table(SPT) code and hence some of SPT routines. > > I was thinking of below possible approach. Use pte_list_walk() to > traverse through list of sptes and use mmu_spte_update() to flip the > PT_WRITABLE_MASK flag. But all SPTEs are not part of any single list; > but on separate lists (based on gfn, page level, memory_slot). So, > recording all the faulted guest GFN and then using above method work ? > There are two ways to write-protect all sptes: - use kvm_mmu_slot_remove_write_access() on all memslots - walk the shadow page cache to get the shadow pages in the highest level (level = 4 on EPT), then write-protect its entries. If you just want to do it for the specified gfn, you can use rmap_write_protect(). Just inquisitive, what is your purpose? :) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] tcm_vhost: Initial merge of vhost level target fabric driver
Hi Linus, Here is the PULL request for the initial merge of tcm_vhost based on RFC-v5 code with MST's ACK appended to the initial merge commit. As promised, the commit is available from two different branches for you to consider merging as for-3.6 code. The 'for-next-merge' branch based on mainline commit 7409a6657ae using 3.5-rc2 code contains two duplicates of pre-merge vhost patch dependencies that have already been merged into mainline via net-next. This commit is also in the 07302012 -next patchset, and available here: git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending.git for-next-merge Or the 'for-linus' branch containing an -rc0 head @ commit bdc0077af57: Merge tag 'scsi-misc' of git://git.kernel.org/../jejb/scsi) rebased up to the last commit in scsi-misc required for virtio-scsi client LLD scanning logic to function properly with tcm_vhost fabric ports, is available here: git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending.git for-linus Both branches have gotten recent testing and have been running over-night small block random I/O tests connected to raw block flash backends. The same diffstat below will result from pulling either branch. Also, the incremental patch to address MST's last round of post-merge comments has been sent to the lists for feedback this afternoon. This will be included into the usual post -rc1 PULL via 3.6-rc-fixes, along with any other bits that end up changing post-merge. Please let us know if you have any concerns. Thank you! --nab Nicholas Bellinger (1): tcm_vhost: Initial merge for vhost level target fabric driver drivers/vhost/Kconfig |3 + drivers/vhost/Kconfig.tcm |6 + drivers/vhost/Makefile|2 + drivers/vhost/tcm_vhost.c | 1628 + drivers/vhost/tcm_vhost.h | 101 +++ 5 files changed, 1740 insertions(+), 0 deletions(-) create mode 100644 drivers/vhost/Kconfig.tcm create mode 100644 drivers/vhost/tcm_vhost.c create mode 100644 drivers/vhost/tcm_vhost.h -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs
On Tue, 2012-07-31 at 03:36 +0300, Michael S. Tsirkin wrote: > On Mon, Jul 30, 2012 at 06:26:31PM -0600, Alex Williamson wrote: > > On Tue, 2012-07-31 at 03:01 +0300, Michael S. Tsirkin wrote: > > > On Mon, Jul 30, 2012 at 10:22:10AM -0600, Alex Williamson wrote: > > > > On Sun, 2012-07-29 at 17:54 +0300, Michael S. Tsirkin wrote: > > > > > On Tue, Jul 24, 2012 at 02:43:22PM -0600, Alex Williamson wrote: > > > > > > This new ioctl enables an eventfd to be triggered when an EOI is > > > > > > written for a specified irqchip pin. The first user of this will > > > > > > be external device assignment through VFIO, using a level irqfd > > > > > > for asserting a PCI INTx interrupt and this interface for de-assert > > > > > > and notification once the interrupt is serviced. > > > > > > > > > > > > Here we make use of the reference counting of the _irq_source > > > > > > object allowing us to share it with an irqfd and cleanup regardless > > > > > > of the release order. > > > > > > > > > > > > Signed-off-by: Alex Williamson > > > > > > > > > > > --- > > > > > > > > > > > > Documentation/virtual/kvm/api.txt | 21 ++ > > > > > > arch/x86/kvm/x86.c|2 > > > > > > include/linux/kvm.h | 15 ++ > > > > > > include/linux/kvm_host.h | 13 + > > > > > > virt/kvm/eventfd.c| 336 > > > > > > + > > > > > > virt/kvm/kvm_main.c | 11 + > > > > > > 6 files changed, 398 insertions(+) > > > > > > > > > > > > diff --git a/Documentation/virtual/kvm/api.txt > > > > > > b/Documentation/virtual/kvm/api.txt > > > > > > index 3911e62..8cd6b36 100644 > > > > > > --- a/Documentation/virtual/kvm/api.txt > > > > > > +++ b/Documentation/virtual/kvm/api.txt > > > > > > @@ -1989,6 +1989,27 @@ return the hash table order in the > > > > > > parameter. (If the guest is using > > > > > > the virtualized real-mode area (VRMA) facility, the kernel will > > > > > > re-create the VMRA HPTEs on the next KVM_RUN of any vcpu.) > > > > > > > > > > > > +4.77 KVM_EOIFD > > > > > > + > > > > > > +Capability: KVM_CAP_EOIFD > > > > > > +Architectures: x86 > > > > > > +Type: vm ioctl > > > > > > +Parameters: struct kvm_eoifd (in) > > > > > > +Returns: 0 on success, < 0 on error > > > > > > + > > > > > > +KVM_EOIFD allows userspace to receive interrupt EOI notification > > > > > > +through an eventfd. > > > > > > > > > > I thought about it some more, and I think it should be renamed to an > > > > > interrupt ack notification than eoi notification. > > > > > For example, consider userspace that uses threaded > > > > > interrupts.interrupts. > > > > > Currently what will happen is each interrupt will be injected > > > > > twice, since on eoi device is still asserting it. > > > > > > > > I don't follow, why is userspace writing an eoi to the ioapic if it > > > > hasn't handled the interrupt > > > > > > It has handled it - it disabled the hardware interrupt. > > > > So it's not injected twice, it's held pending at the ioapic the second > > time, just like hardware. > > It is not like hardware at all. in hardware there is no overhead > here you interrupot the guest to run handler in host. Obviously we have some overhead, we're emulating the guest hardware. That doesn't make the behavior unlike hardware. > > Maybe there's a future optimization there, > > but I don't think it's appropriate at this time. > > Yes. But to make it *possible* in future we must remove > the requirement to signal fd immediately on EOI. > So rename it ackfd. How does the name make that possible? We can easily add a flag EOIFD_FLAG_EOI_ON_REENABLE, or whatever. > > > > and why wouldn't the same happen on bare > > > > metal? > > > > > > on bare metal level does not matter as long as interrupt > > > is disabled. > > > > > > > > One fix would be to delay event until interrupt is re-enabled. > > > > > Now I am not asking you to fix this immediately, > > > > > but I think we should make the interface generic by > > > > > saying we report an ack to userspace and not specifically EOI. > > > > > > > > Using the word "delay" in the context of interrupt delivery raises all > > > > sorts of red flags for me, but I really don't understand your argument. > > > > > > I am saying it's an "ack" of interrupt userspace cares about. > > > The fact it is done by EOI is an implementation detail. > > > > The implementation is how an EOI is generated on an ioapic, not that an > > EOI exists. How do I read a hardware spec and figure out what "ack of > > interrupt" means? > > It just means it will be called after guest has completed handling > interrupt. How we detect that is our problem. Conceptually, we're still looking for the EOI, we may just be able to optimize to EOI && irqchip pin unmasked. The name doesn't prohibit anything here and eoifd is more descriptive that it relates to the end of an interrupt where ackfd means nothing (what got acked?). > > > > >
Re: [PATCH 2/2] KVM: PPC: booke: Added debug handler
On 07/30/2012 06:11 AM, Bharat Bhushan wrote: > +1: /* debug interrupt happened in guest */ > + mfspr r4, \scratch > + mtcrr3 > + mr r3, r4 > + mfspr r4, SPRN_SPRG_THREAD > + lwz r4, THREAD_KVM_VCPU(r4) > + stw r3, VCPU_GPR(r4)(r4) > + stw r5, VCPU_GPR(r5)(r4) > + stw r6, VCPU_GPR(r6)(r4) You're not working on the latest tree -- all this stuff is "VCPU_GPR(R6)(r4)" now. -Scott -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs
On Mon, Jul 30, 2012 at 06:26:31PM -0600, Alex Williamson wrote: > On Tue, 2012-07-31 at 03:01 +0300, Michael S. Tsirkin wrote: > > On Mon, Jul 30, 2012 at 10:22:10AM -0600, Alex Williamson wrote: > > > On Sun, 2012-07-29 at 17:54 +0300, Michael S. Tsirkin wrote: > > > > On Tue, Jul 24, 2012 at 02:43:22PM -0600, Alex Williamson wrote: > > > > > This new ioctl enables an eventfd to be triggered when an EOI is > > > > > written for a specified irqchip pin. The first user of this will > > > > > be external device assignment through VFIO, using a level irqfd > > > > > for asserting a PCI INTx interrupt and this interface for de-assert > > > > > and notification once the interrupt is serviced. > > > > > > > > > > Here we make use of the reference counting of the _irq_source > > > > > object allowing us to share it with an irqfd and cleanup regardless > > > > > of the release order. > > > > > > > > > > Signed-off-by: Alex Williamson > > > > > > > > > --- > > > > > > > > > > Documentation/virtual/kvm/api.txt | 21 ++ > > > > > arch/x86/kvm/x86.c|2 > > > > > include/linux/kvm.h | 15 ++ > > > > > include/linux/kvm_host.h | 13 + > > > > > virt/kvm/eventfd.c| 336 > > > > > + > > > > > virt/kvm/kvm_main.c | 11 + > > > > > 6 files changed, 398 insertions(+) > > > > > > > > > > diff --git a/Documentation/virtual/kvm/api.txt > > > > > b/Documentation/virtual/kvm/api.txt > > > > > index 3911e62..8cd6b36 100644 > > > > > --- a/Documentation/virtual/kvm/api.txt > > > > > +++ b/Documentation/virtual/kvm/api.txt > > > > > @@ -1989,6 +1989,27 @@ return the hash table order in the parameter. > > > > > (If the guest is using > > > > > the virtualized real-mode area (VRMA) facility, the kernel will > > > > > re-create the VMRA HPTEs on the next KVM_RUN of any vcpu.) > > > > > > > > > > +4.77 KVM_EOIFD > > > > > + > > > > > +Capability: KVM_CAP_EOIFD > > > > > +Architectures: x86 > > > > > +Type: vm ioctl > > > > > +Parameters: struct kvm_eoifd (in) > > > > > +Returns: 0 on success, < 0 on error > > > > > + > > > > > +KVM_EOIFD allows userspace to receive interrupt EOI notification > > > > > +through an eventfd. > > > > > > > > I thought about it some more, and I think it should be renamed to an > > > > interrupt ack notification than eoi notification. > > > > For example, consider userspace that uses threaded interrupts. > > > > Currently what will happen is each interrupt will be injected > > > > twice, since on eoi device is still asserting it. > > > > > > I don't follow, why is userspace writing an eoi to the ioapic if it > > > hasn't handled the interrupt > > > > It has handled it - it disabled the hardware interrupt. > > So it's not injected twice, it's held pending at the ioapic the second > time, just like hardware. It is not like hardware at all. in hardware there is no overhead here you interrupot the guest to run handler in host. > Maybe there's a future optimization there, > but I don't think it's appropriate at this time. Yes. But to make it *possible* in future we must remove the requirement to signal fd immediately on EOI. So rename it ackfd. > > > and why wouldn't the same happen on bare > > > metal? > > > > on bare metal level does not matter as long as interrupt > > is disabled. > > > > > > One fix would be to delay event until interrupt is re-enabled. > > > > Now I am not asking you to fix this immediately, > > > > but I think we should make the interface generic by > > > > saying we report an ack to userspace and not specifically EOI. > > > > > > Using the word "delay" in the context of interrupt delivery raises all > > > sorts of red flags for me, but I really don't understand your argument. > > > > I am saying it's an "ack" of interrupt userspace cares about. > > The fact it is done by EOI is an implementation detail. > > The implementation is how an EOI is generated on an ioapic, not that an > EOI exists. How do I read a hardware spec and figure out what "ack of > interrupt" means? It just means it will be called after guest has completed handling interrupt. How we detect that is our problem. > > > > > kvm_eoifd.fd specifies the eventfd used for > > > > > +notification. KVM_EOIFD_FLAG_DEASSIGN is used to de-assign an eoifd > > > > > +once assigned. KVM_EOIFD also requires additional bits set in > > > > > +kvm_eoifd.flags to bind to the proper interrupt line. The > > > > > +KVM_EOIFD_FLAG_LEVEL_IRQFD indicates that kvm_eoifd.key is provided > > > > > +and is a key from a level triggered interrupt (configured from > > > > > +KVM_IRQFD using KVM_IRQFD_FLAG_LEVEL). The EOI notification is bound > > > > > +to the same GSI and irqchip input as the irqfd. Both kvm_eoifd.key > > > > > +and KVM_EOIFD_FLAG_LEVEL_IRQFD must be specified on assignment and > > > > > +de-assignment of KVM_EOIFD. A level irqfd may only be bound to a > > > > > +single eoifd. KVM
Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs
On Tue, 2012-07-31 at 03:01 +0300, Michael S. Tsirkin wrote: > On Mon, Jul 30, 2012 at 10:22:10AM -0600, Alex Williamson wrote: > > On Sun, 2012-07-29 at 17:54 +0300, Michael S. Tsirkin wrote: > > > On Tue, Jul 24, 2012 at 02:43:22PM -0600, Alex Williamson wrote: > > > > This new ioctl enables an eventfd to be triggered when an EOI is > > > > written for a specified irqchip pin. The first user of this will > > > > be external device assignment through VFIO, using a level irqfd > > > > for asserting a PCI INTx interrupt and this interface for de-assert > > > > and notification once the interrupt is serviced. > > > > > > > > Here we make use of the reference counting of the _irq_source > > > > object allowing us to share it with an irqfd and cleanup regardless > > > > of the release order. > > > > > > > > Signed-off-by: Alex Williamson > > > > > > > --- > > > > > > > > Documentation/virtual/kvm/api.txt | 21 ++ > > > > arch/x86/kvm/x86.c|2 > > > > include/linux/kvm.h | 15 ++ > > > > include/linux/kvm_host.h | 13 + > > > > virt/kvm/eventfd.c| 336 > > > > + > > > > virt/kvm/kvm_main.c | 11 + > > > > 6 files changed, 398 insertions(+) > > > > > > > > diff --git a/Documentation/virtual/kvm/api.txt > > > > b/Documentation/virtual/kvm/api.txt > > > > index 3911e62..8cd6b36 100644 > > > > --- a/Documentation/virtual/kvm/api.txt > > > > +++ b/Documentation/virtual/kvm/api.txt > > > > @@ -1989,6 +1989,27 @@ return the hash table order in the parameter. > > > > (If the guest is using > > > > the virtualized real-mode area (VRMA) facility, the kernel will > > > > re-create the VMRA HPTEs on the next KVM_RUN of any vcpu.) > > > > > > > > +4.77 KVM_EOIFD > > > > + > > > > +Capability: KVM_CAP_EOIFD > > > > +Architectures: x86 > > > > +Type: vm ioctl > > > > +Parameters: struct kvm_eoifd (in) > > > > +Returns: 0 on success, < 0 on error > > > > + > > > > +KVM_EOIFD allows userspace to receive interrupt EOI notification > > > > +through an eventfd. > > > > > > I thought about it some more, and I think it should be renamed to an > > > interrupt ack notification than eoi notification. > > > For example, consider userspace that uses threaded interrupts. > > > Currently what will happen is each interrupt will be injected > > > twice, since on eoi device is still asserting it. > > > > I don't follow, why is userspace writing an eoi to the ioapic if it > > hasn't handled the interrupt > > It has handled it - it disabled the hardware interrupt. So it's not injected twice, it's held pending at the ioapic the second time, just like hardware. Maybe there's a future optimization there, but I don't think it's appropriate at this time. > > and why wouldn't the same happen on bare > > metal? > > on bare metal level does not matter as long as interrupt > is disabled. > > > > One fix would be to delay event until interrupt is re-enabled. > > > Now I am not asking you to fix this immediately, > > > but I think we should make the interface generic by > > > saying we report an ack to userspace and not specifically EOI. > > > > Using the word "delay" in the context of interrupt delivery raises all > > sorts of red flags for me, but I really don't understand your argument. > > I am saying it's an "ack" of interrupt userspace cares about. > The fact it is done by EOI is an implementation detail. The implementation is how an EOI is generated on an ioapic, not that an EOI exists. How do I read a hardware spec and figure out what "ack of interrupt" means? > > > > kvm_eoifd.fd specifies the eventfd used for > > > > +notification. KVM_EOIFD_FLAG_DEASSIGN is used to de-assign an eoifd > > > > +once assigned. KVM_EOIFD also requires additional bits set in > > > > +kvm_eoifd.flags to bind to the proper interrupt line. The > > > > +KVM_EOIFD_FLAG_LEVEL_IRQFD indicates that kvm_eoifd.key is provided > > > > +and is a key from a level triggered interrupt (configured from > > > > +KVM_IRQFD using KVM_IRQFD_FLAG_LEVEL). The EOI notification is bound > > > > +to the same GSI and irqchip input as the irqfd. Both kvm_eoifd.key > > > > +and KVM_EOIFD_FLAG_LEVEL_IRQFD must be specified on assignment and > > > > +de-assignment of KVM_EOIFD. A level irqfd may only be bound to a > > > > +single eoifd. KVM_CAP_EOIFD_LEVEL_IRQFD indicates support of > > > > +KVM_EOIFD_FLAG_LEVEL_IRQFD. > > > > > > > > > > Hmm returning the key means we'll need to keep refcounting for source > > > IDs around forever. I liked passing the fd better: make implementation > > > match interface and not the other way around. > > > > False, a source ID has a finite lifecycle. The fd approach was broken. > > Holding the irqfd context imposed too many dependencies between eoifd > > and irqfd necessitating things like one interface disabling another. I > > thoroughly disagree with that approach. > > You keep saying this but it is
Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs
On Mon, Jul 30, 2012 at 10:22:10AM -0600, Alex Williamson wrote: > On Sun, 2012-07-29 at 17:54 +0300, Michael S. Tsirkin wrote: > > On Tue, Jul 24, 2012 at 02:43:22PM -0600, Alex Williamson wrote: > > > This new ioctl enables an eventfd to be triggered when an EOI is > > > written for a specified irqchip pin. The first user of this will > > > be external device assignment through VFIO, using a level irqfd > > > for asserting a PCI INTx interrupt and this interface for de-assert > > > and notification once the interrupt is serviced. > > > > > > Here we make use of the reference counting of the _irq_source > > > object allowing us to share it with an irqfd and cleanup regardless > > > of the release order. > > > > > > Signed-off-by: Alex Williamson > > > > > --- > > > > > > Documentation/virtual/kvm/api.txt | 21 ++ > > > arch/x86/kvm/x86.c|2 > > > include/linux/kvm.h | 15 ++ > > > include/linux/kvm_host.h | 13 + > > > virt/kvm/eventfd.c| 336 > > > + > > > virt/kvm/kvm_main.c | 11 + > > > 6 files changed, 398 insertions(+) > > > > > > diff --git a/Documentation/virtual/kvm/api.txt > > > b/Documentation/virtual/kvm/api.txt > > > index 3911e62..8cd6b36 100644 > > > --- a/Documentation/virtual/kvm/api.txt > > > +++ b/Documentation/virtual/kvm/api.txt > > > @@ -1989,6 +1989,27 @@ return the hash table order in the parameter. (If > > > the guest is using > > > the virtualized real-mode area (VRMA) facility, the kernel will > > > re-create the VMRA HPTEs on the next KVM_RUN of any vcpu.) > > > > > > +4.77 KVM_EOIFD > > > + > > > +Capability: KVM_CAP_EOIFD > > > +Architectures: x86 > > > +Type: vm ioctl > > > +Parameters: struct kvm_eoifd (in) > > > +Returns: 0 on success, < 0 on error > > > + > > > +KVM_EOIFD allows userspace to receive interrupt EOI notification > > > +through an eventfd. > > > > I thought about it some more, and I think it should be renamed to an > > interrupt ack notification than eoi notification. > > For example, consider userspace that uses threaded interrupts. > > Currently what will happen is each interrupt will be injected > > twice, since on eoi device is still asserting it. > > I don't follow, why is userspace writing an eoi to the ioapic if it > hasn't handled the interrupt It has handled it - it disabled the hardware interrupt. > and why wouldn't the same happen on bare > metal? on bare metal level does not matter as long as interrupt is disabled. > > One fix would be to delay event until interrupt is re-enabled. > > Now I am not asking you to fix this immediately, > > but I think we should make the interface generic by > > saying we report an ack to userspace and not specifically EOI. > > Using the word "delay" in the context of interrupt delivery raises all > sorts of red flags for me, but I really don't understand your argument. I am saying it's an "ack" of interrupt userspace cares about. The fact it is done by EOI is an implementation detail. > > > kvm_eoifd.fd specifies the eventfd used for > > > +notification. KVM_EOIFD_FLAG_DEASSIGN is used to de-assign an eoifd > > > +once assigned. KVM_EOIFD also requires additional bits set in > > > +kvm_eoifd.flags to bind to the proper interrupt line. The > > > +KVM_EOIFD_FLAG_LEVEL_IRQFD indicates that kvm_eoifd.key is provided > > > +and is a key from a level triggered interrupt (configured from > > > +KVM_IRQFD using KVM_IRQFD_FLAG_LEVEL). The EOI notification is bound > > > +to the same GSI and irqchip input as the irqfd. Both kvm_eoifd.key > > > +and KVM_EOIFD_FLAG_LEVEL_IRQFD must be specified on assignment and > > > +de-assignment of KVM_EOIFD. A level irqfd may only be bound to a > > > +single eoifd. KVM_CAP_EOIFD_LEVEL_IRQFD indicates support of > > > +KVM_EOIFD_FLAG_LEVEL_IRQFD. > > > > > > > Hmm returning the key means we'll need to keep refcounting for source > > IDs around forever. I liked passing the fd better: make implementation > > match interface and not the other way around. > > False, a source ID has a finite lifecycle. The fd approach was broken. > Holding the irqfd context imposed too many dependencies between eoifd > and irqfd necessitating things like one interface disabling another. I > thoroughly disagree with that approach. You keep saying this but it is still true: once irqfd is closed eoifd does not get any more interrupts. > > > 5. The kvm_run structure > > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > > index 9ded39d..8f3164e 100644 > > > --- a/arch/x86/kvm/x86.c > > > +++ b/arch/x86/kvm/x86.c > > > @@ -2171,6 +2171,8 @@ int kvm_dev_ioctl_check_extension(long ext) > > > case KVM_CAP_PCI_2_3: > > > case KVM_CAP_KVMCLOCK_CTRL: > > > case KVM_CAP_IRQFD_LEVEL: > > > + case KVM_CAP_EOIFD: > > > + case KVM_CAP_EOIFD_LEVEL_IRQFD: > > > r = 1; > > > break; > > > case KVM_CAP_COALES
Re: [PATCH] KVM: Don't update PPR on any APIC read
On Sun, Jul 22, 2012 at 05:41:00PM +0300, Avi Kivity wrote: > The current code will update the PPR on almost any APIC read; however > that's only required if we read the PPR. > > kvm_update_ppr() shows up in some profiles, albeit with a low usage (~1%). > This should reduce it further (it will still be called during interrupt > processing). > > Signed-off-by: Avi Kivity Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm-1.1.0 crashing with kernel 3.5.0-rc6
On Mon, Jul 30, 2012 at 07:39:31PM +0300, Avi Kivity wrote: > On 07/30/2012 05:07 PM, Chris Clayton wrote: > >> > With kernel 3.5.0 with b2da15ac26a0c00 reverted, I have just had 15 > clean invocations of vanilla qemu-kvm-1.1.1. So that commit would seem > to be the problem. > >>> > >>> Just to be sure, I've run some more tests today. No crashes occurred in > >>> 20 runs of vanilla qemu-kvm-1.1.1 on kernel 3.5.0 with b2da15ac26a0c00 > >>> reverted. > >> > >> Ok. I'm trying to reproduce it here on a nested-virt setup, since the > >> code looks correct. > >> > >> What's your preemption settings? > >> > >> > > [chris:~/kernel/linux-3.5.0]$ grep PREEMPT .config > > CONFIG_TREE_PREEMPT_RCU=y > > CONFIG_PREEMPT_RCU=y > > CONFIG_PREEMPT_NOTIFIERS=y > > # CONFIG_PREEMPT_NONE is not set > > # CONFIG_PREEMPT_VOLUNTARY is not set > > CONFIG_PREEMPT=y > > CONFIG_PREEMPT_COUNT=y > > Here's what I think that is happening > > vcpu_load > ... > vmx_save_host_state > vmx_vcpu_run > (ds.cpl, es.cpl cleared by hardware) > > interrupt > push ds, es # pushes bad ds, es > schedule > vmx_vcpu_put > vmx_load_host_state > reload ds, es > pop ds, es # of other thread's stack > iret > # other thread runs > interrupt > schedule # back in vcpu thread > interrupt return: pop ds, es # <-- problem > iret > >... >vcpu_put > ># bad ds, es, but !vmx->host_state.loaded > > Marcelo, did I miss something here? Don't think so. > > Unfortunately, my reproducer has ceased to reproduce. But the fix is > easy if the analysis above is right. > > -- > error compiling committee.c: too many arguments to function > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL (PATCH 0/4)] VFIO driver for v3.6
On Fri, 2012-07-27 at 15:32 +1000, Paul Mackerras wrote: > On Wed, Jul 25, 2012 at 08:53:06AM -0600, Alex Williamson wrote: > > Hi Linus, > > > > This series includes the VFIO userspace driver interface for the > > 3.6 kernel merge window. This driver is intended to provide a > > secure interface for device access using IOMMU protection for > > applications like assignment of physical devices to virtual > > machines. Qemu will be the first user of this interface, enabling > > assignment of PCI devices to Qemu guests. This interface is > > intended to eventually replace the x86-specific assignment mechanism > > currently available in KVM. This interface has the advantage of > > being more secure, by working with IOMMU groups to ensure device > > isolation and providing it's own filtered resource access mechanism, > > and also more flexible, in not being x86 or KVM specific (extensions > > to enable POWER are already working). > > > > As a new driver, I'm including both the individual patches in email, > > as well as a branch to pull from: > > > > git://github.com/awilliam/linux-vfio.git for-linus > > > > This driver is originally the work of Tom Lyon, but has since been > > handed over to me and gone through a complete overhaul thanks to the > > input from David Gibson, Ben Herrenschmidt, Chris Wright, Joerg > > Roedel, and others. This driver has been available in linux-next for > > the last month. Thanks, > > Linus, > > Are you thinking of pulling this driver in for 3.6? I would be glad > to see it go in since we want to use it with KVM on PowerPC. If > possible we'd like the PowerPC bits for it to go in as well. I'm pretty anxious to find out as well. Linus, ping, any thoughts on including this in 3.6? Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: PPC: Book3S_32: Fix MTMSR_EERI macro
Commit b38c77d82e4 moved the MTMSR_EERI macro from the KVM code to generic ppc_asm.h code. However, while adding it in the headers for the ppc32 case, it missed out to remove the former definition in the KVM code. This patch fixes compilation on server type PPC32 targets with CONFIG_KVM enabled. Signed-off-by: Alexander Graf --- arch/powerpc/kvm/book3s_rmhandlers.S |1 - 1 files changed, 0 insertions(+), 1 deletions(-) diff --git a/arch/powerpc/kvm/book3s_rmhandlers.S b/arch/powerpc/kvm/book3s_rmhandlers.S index ab523f3..9ecf6e3 100644 --- a/arch/powerpc/kvm/book3s_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_rmhandlers.S @@ -67,7 +67,6 @@ kvmppc_skip_Hinterrupt: #elif defined(CONFIG_PPC_BOOK3S_32) #define FUNC(name) name -#define MTMSR_EERI(reg)mtmsr (reg) .macro INTERRUPT_TRAMPOLINE intno -- 1.6.0.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH] vfio: VFIO PCI driver for Qemu
On Sun, 2012-07-29 at 16:47 +0300, Avi Kivity wrote: > On 07/26/2012 08:40 PM, Alex Williamson wrote: > > On Thu, 2012-07-26 at 19:34 +0300, Avi Kivity wrote: > >> On 07/25/2012 08:03 PM, Alex Williamson wrote: > >> > >> > +/* > >> > + * Resource setup > >> > + */ > >> > +static void vfio_unmap_bar(VFIODevice *vdev, int nr) > >> > +{ > >> > +VFIOBAR *bar = &vdev->bars[nr]; > >> > +uint64_t size; > >> > + > >> > +if (!memory_region_size(&bar->mem)) { > >> > +return; > >> > +} > > > > This one is the "slow" mapped MemoryRegion. If there's nothing here, > > the BAR isn't populated. > > > >> > + > >> > +size = memory_region_size(&bar->mmap_mem); > >> > +if (size) { > >> > + memory_region_del_subregion(&bar->mem, &bar->mmap_mem); > >> > + munmap(bar->mmap, size); > >> > +} > > > > This is the direct mapped MemoryRegion that potentially overlays the > > "slow" mapping above for MMIO BARs of sufficient alignment. If the BAR > > includes the MSI-X vector table, this maps the region in front of the > > table > > If the region size is zero, then both memory_region_del_subregion() > (assuming the region is parented) and munmap() do nothing. So you could > call this unconditionally. I suppose parenting them is the key. I'm counting on memory_region_size of zero for an uninitialized, g_malloc0() MemoryRegion. Initializing them just to have a parent so we can unconditionally remove them here seems like it's just shifting complexity from one function to another. The majority of BARs aren't even implemented, so we'd actually be setting up a lot of dummy infrastructure for a slightly cleaner unmap function. I'll keep looking at this, but I'm not optimistic there's an overall simplification here. > >> > + > >> > +if (vdev->msix && vdev->msix->table_bar == nr) { > >> > +size = memory_region_size(&vdev->msix->mmap_mem); > >> > +if (size) { > >> > +memory_region_del_subregion(&bar->mem, > >> > &vdev->msix->mmap_mem); > >> > +munmap(vdev->msix->mmap, size); > >> > +} > >> > +} > > > > And this one potentially unmaps the overlap after the vector table if > > there's any space for one. > > > >> Are the three size checks needed? Everything should work without them > >> from the memory core point of view. > > > > I haven't tried, but I strongly suspect I shouldn't be munmap'ing > > NULL... no? > > NULL isn't the problem (well some kernels protect against mmaping NULL > to avoid kernel exploits), but it seems the kernel doesn't like a zero > length. in mm/mmap.c:do_munmap() I see: if ((len = PAGE_ALIGN(len)) == 0) return -EINVAL; Before anything scary happens, so that should be ok. It's not really worthwhile to call the munmaps unconditionally if we already have the condition tests because the subregions are unparented though. > >> > + > >> > +static void vfio_map_bar(VFIODevice *vdev, int nr, uint8_t type) > >> > +{ > >> > +VFIOBAR *bar = &vdev->bars[nr]; > >> > +unsigned size = bar->size; > >> > +char name[64]; > >> > + > >> > +sprintf(name, "VFIO %04x:%02x:%02x.%x BAR %d", vdev->host.domain, > >> > +vdev->host.bus, vdev->host.slot, vdev->host.function, nr); > >> > + > >> > +/* A "slow" read/write mapping underlies all BARs */ > >> > +memory_region_init_io(&bar->mem, &vfio_bar_ops, bar, name, size); > >> > +pci_register_bar(&vdev->pdev, nr, type, &bar->mem); > >> > >> So far all container BARs have been pure containers, without RAM or I/O > >> callbacks. It should all work, but this sets precedent and requires it > >> to work. I guess there's no problem supporting it though. > > > > KVM device assignment already makes use of this as well, if I understand > > correctly. > > Okay. > > > > >> > + > >> > +if (type & PCI_BASE_ADDRESS_SPACE_IO) { > >> > +return; /* IO space is only slow, don't expect high perf here */ > >> > +} > >> > >> What about non-x86 where IO is actually memory? I think you can drop > >> this and let the address space filtering in the listener drop it if it > >> turns out to be in IO space. > > > > They're probably saying "What's I/O port space?" ;) Yeah, there may be > > some room to do more here, but no need until we have something that can > > make use of it. > > Most likely all that is needed is to drop the test. > > > Note that these are the BAR mappings, which turn into > > MemoryRegions, so I'm not sure what the listener has to do with > > filtering these just yet. > > +static bool vfio_listener_skipped_section(MemoryRegionSection *section) > +{ > +return (section->address_space != get_system_memory() || > +!memory_region_is_ram(section->mr)); > +} > > Or the filter argument to memory_listener_register() (which you use -- > you can drop the first check above). On x86 those I/O regions will be > filtered out, on non-x86 with a properly-wired chipset emulation they'll
[PATCH] tcm_vhost: Post-merge review changes requested by MST
From: Nicholas Bellinger This patch contains the post RFC-v5 (post-merge) changes, this includes: - Add locking comment - Move vhost_scsi_complete_cmd ahead of TFO callbacks in order to drop forward declarations - Drop extra '!= NULL' usage in vhost_scsi_complete_cmd_work() - Change vhost_scsi_*_handle_kick() to use pr_debug - Fix possible race in vhost_scsi_set_endpoint() for vs->vs_tpg checking + assignment. - Convert tv_tpg->tpg_vhost_count + ->tv_tpg_port_count from atomic_t -> int, and make sure reference is protected by ->tv_tpg_mutex. - Drop unnecessary vhost_scsi->vhost_ref_cnt - Add 'err:' label for exception path in vhost_scsi_clear_endpoint() - Add enum for VQ numbers, add usage in vhost_scsi_open() - Add vhost_scsi_flush() + vhost_scsi_flush_vq() following drivers/vhost/net.c - Add smp_wmb() + vhost_scsi_flush() call during vhost_scsi_set_features() - Drop unnecessary copy_from_user() usage with GET_ABI_VERSION ioctl - Add missing vhost_scsi_compat_ioctl() caller for vhost_scsi_fops - Fix function parameter definition first line to follow existing vhost code style - Change 'vHost' usage -> 'vhost' in handful of locations - Change -EPERM -> -EBUSY usage for two failures in tcm_vhost_drop_nexus() - Add comment for tcm_vhost_workqueue in tcm_vhost_init() - Make GET_ABI_VERSION return 'int' + add comment in tcm_vhost.h Reported-by: Michael S. Tsirkin Cc: Michael S. Tsirkin Cc: Stefan Hajnoczi Cc: Anthony Liguori Cc: Zhi Yong Wu Cc: Paolo Bonzini Signed-off-by: Nicholas Bellinger --- drivers/vhost/tcm_vhost.c | 195 drivers/vhost/tcm_vhost.h |9 +- 2 files changed, 111 insertions(+), 93 deletions(-) diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c index 481af88..74b2eda 100644 --- a/drivers/vhost/tcm_vhost.c +++ b/drivers/vhost/tcm_vhost.c @@ -53,9 +53,14 @@ #include "vhost.h" #include "tcm_vhost.h" +enum { + VHOST_SCSI_VQ_CTL = 0, + VHOST_SCSI_VQ_EVT = 1, + VHOST_SCSI_VQ_IO = 2, +}; + struct vhost_scsi { - atomic_t vhost_ref_cnt; - struct tcm_vhost_tpg *vs_tpg; + struct tcm_vhost_tpg *vs_tpg; /* Protected by vhost_scsi->dev.mutex */ struct vhost_dev dev; struct vhost_virtqueue vqs[3]; @@ -131,8 +136,7 @@ static u32 tcm_vhost_get_default_depth(struct se_portal_group *se_tpg) return 1; } -static u32 tcm_vhost_get_pr_transport_id( - struct se_portal_group *se_tpg, +static u32 tcm_vhost_get_pr_transport_id(struct se_portal_group *se_tpg, struct se_node_acl *se_nacl, struct t10_pr_registration *pr_reg, int *format_code, @@ -162,8 +166,7 @@ static u32 tcm_vhost_get_pr_transport_id( format_code, buf); } -static u32 tcm_vhost_get_pr_transport_id_len( - struct se_portal_group *se_tpg, +static u32 tcm_vhost_get_pr_transport_id_len(struct se_portal_group *se_tpg, struct se_node_acl *se_nacl, struct t10_pr_registration *pr_reg, int *format_code) @@ -192,8 +195,7 @@ static u32 tcm_vhost_get_pr_transport_id_len( format_code); } -static char *tcm_vhost_parse_pr_out_transport_id( - struct se_portal_group *se_tpg, +static char *tcm_vhost_parse_pr_out_transport_id(struct se_portal_group *se_tpg, const char *buf, u32 *out_tid_len, char **port_nexus_ptr) @@ -236,8 +238,7 @@ static struct se_node_acl *tcm_vhost_alloc_fabric_acl( return &nacl->se_node_acl; } -static void tcm_vhost_release_fabric_acl( - struct se_portal_group *se_tpg, +static void tcm_vhost_release_fabric_acl(struct se_portal_group *se_tpg, struct se_node_acl *se_nacl) { struct tcm_vhost_nacl *nacl = container_of(se_nacl, @@ -297,7 +298,16 @@ static int tcm_vhost_get_cmd_state(struct se_cmd *se_cmd) return 0; } -static void vhost_scsi_complete_cmd(struct tcm_vhost_cmd *); +static void vhost_scsi_complete_cmd(struct tcm_vhost_cmd *tv_cmd) +{ + struct vhost_scsi *vs = tv_cmd->tvc_vhost; + + spin_lock_bh(&vs->vs_completion_lock); + list_add_tail(&tv_cmd->tvc_completion_list, &vs->vs_completion_list); + spin_unlock_bh(&vs->vs_completion_lock); + + vhost_work_queue(&vs->dev, &vs->vs_completion_work); +} static int tcm_vhost_queue_data_in(struct se_cmd *se_cmd) { @@ -381,7 +391,7 @@ static void vhost_scsi_complete_cmd_work(struct vhost_work *work) vs_completion_work); struct tcm_vhost_cmd *tv_cmd; - while ((tv_cmd = vhost_scsi_get_cmd_from_completion(vs)) != NULL) { + while ((tv_cmd = vhost_scsi_get_cmd_from_completion(vs))) { struct virtio_scsi_cmd_resp v_rsp; struct se_cmd *se_cmd = &tv_cmd->tvc_se_cmd; int ret; @@ -408,19 +418,6 @@ static void vhost_scsi_complete_cmd_work(struct vhost_work *work) vhost_signal(&vs->dev, &vs->vqs[2]); } -static void vhost_sc
Re: [Qemu-devel] 9p broken?
On Mon, Jul 30, 2012 at 03:35:39PM +0300, Avi Kivity wrote: > Having an annoying bug on i386 kvm I decided to debug it buy running an > i386 guest on my x86_64 host, use 9p to access a guest image, and run it > using nested kvm. > > However, 9p appears to be broken: first, the configure test fails (patch > sent). Second, while mount works, ls on the mount point causes qemu to > crash with > > (gdb) bt > #0 error_set (errp=0x7fffe95fb128, fmt=0x558d4568 "{ 'class': > 'VirtFSFeatureBlocksMigration', 'data': { 'path': %s, 'tag': %s } }") at > /home/tlv/akivity/qemu/error.c:32 > #1 0x5567cb06 in v9fs_attach (opaque=0x7fffe95e3020) at > /home/tlv/akivity/qemu/hw/9pfs/virtio-9p.c:988 > #2 0x5561d19f in coroutine_trampoline (i0=1449767888, i1=21845) > at /home/tlv/akivity/qemu/coroutine-ucontext.c:138 > #3 0x75a93ef0 in ?? () from /lib64/libc.so.6 > #4 0x7fffce00 in ?? () > #5 0x in ?? ( > > **errp already points to a VirtFSFeatureBlocksMigration error; > v9fs_attach() has been called a second time (the first time, > understandably, on mount; the second on ls). Yes, I can reproduce this too. LIBGUESTFS_QEMU=~/d/qemu/qemu.wrapper \ guestfish -v -- \ sparse /tmp/unused 100M : \ config -device 'virtio-9p-pci,fsdev=root,mount_tag=root' : \ config -fsdev 'local,id=root,path=/tmp,security_model=passthrough' : \ run : \ mount-9p root / : \ ls / Stack trace: #0 0x7fb1d4d19ba5 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:63 #1 0x7fb1d4d1b358 in __GI_abort () at abort.c:90 #2 0x7fb1d4d12972 in __assert_fail_base (fmt= 0x7fb1d4e5c8e8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x7fb1d8e2e87e "*errp == ((void *)0)", file=file@entry=0x7fb1d8e56217 "error.c", line=line@entry=35, function=function@entry= 0x7fb1d8e2e8ca <__PRETTY_FUNCTION__.13983> "error_set") at assert.c:92 #3 0x7fb1d4d12a22 in __GI___assert_fail (assertion=assertion@entry= 0x7fb1d8e2e87e "*errp == ((void *)0)", file=file@entry= 0x7fb1d8e56217 "error.c", line=line@entry=35, function=function@entry= 0x7fb1d8e2e8ca <__PRETTY_FUNCTION__.13983> "error_set") at assert.c:101 #4 0x7fb1d8c1147f in error_set (errp=errp@entry=0x7fb1ce36a128, fmt=fmt@entry= 0x7fb1d8e39a78 "{ 'class': 'VirtFSFeatureBlocksMigration', 'data': { 'path': %s, 'tag': %s } }") at error.c:35 #5 0x7fb1d8c57f9b in v9fs_attach (opaque=0x7fb1ce352020) at /home/rjones/d/qemu/hw/9pfs/virtio-9p.c:988 #6 0x7fb1d8c0fcfa in coroutine_trampoline (i0=, i1=) at coroutine-ucontext.c:138 #7 0x7fb1d4d2a2f0 in ?? () from /lib64/libc.so.6 #8 0x7fff51061aa0 in ?? () #9 0xb5b5b5b5b5b5b5b5 in ?? () #10 0x in ?? () I'll add a regression test for 9p to libguestfs so at least we will catch this in future during Fedora builds. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones virt-df lists disk usage of guests without needing to install any software inside the virtual machine. Supports Linux and Windows. http://et.redhat.com/~rjones/virt-df/ -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: PPC: booke/bookehv: Add guest debug support
On 07/30/2012 02:37 AM, Bhushan Bharat-R65777 wrote: > > >> -Original Message- >> From: Wood Scott-B07421 >> Sent: Friday, July 27, 2012 7:00 AM >> To: Bhushan Bharat-R65777 >> Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; ag...@suse.de; Bhushan >> Bharat- >> R65777 >> Subject: Re: [PATCH 2/2] KVM: PPC: booke/bookehv: Add guest debug support >> >> On 07/26/2012 12:32 AM, Bharat Bhushan wrote: >>> This patch adds: >>> 1) KVM debug handler added for e500v2. >>> 2) Guest debug by qemu gdb stub. >> >> Does it make sense for these to both be in the same patch? If there's common >> code used by both, that could be added first. > > ok > >> >>> Signed-off-by: Liu Yu >>> Signed-off-by: Varun Sethi >>> [bharat.bhus...@freescale.com: Substantial changes] >>> Signed-off-by: Bharat Bhushan >>> --- >>> arch/powerpc/include/asm/kvm.h| 21 + >>> arch/powerpc/include/asm/kvm_host.h |7 ++ >>> arch/powerpc/include/asm/kvm_ppc.h|2 + >>> arch/powerpc/include/asm/reg_booke.h |1 + >>> arch/powerpc/kernel/asm-offsets.c | 31 ++- >>> arch/powerpc/kvm/booke.c | 146 +++--- >>> arch/powerpc/kvm/booke_interrupts.S | 160 >> - >>> arch/powerpc/kvm/bookehv_interrupts.S | 141 - >>> arch/powerpc/kvm/e500mc.c |3 +- >>> arch/powerpc/kvm/powerpc.c|2 +- >>> 10 files changed, 492 insertions(+), 22 deletions(-) >>> >>> diff --git a/arch/powerpc/include/asm/kvm.h >>> b/arch/powerpc/include/asm/kvm.h index 3c14202..da71c84 100644 >>> --- a/arch/powerpc/include/asm/kvm.h >>> +++ b/arch/powerpc/include/asm/kvm.h >>> @@ -25,6 +25,7 @@ >>> /* Select powerpc specific features in */ #define >>> __KVM_HAVE_SPAPR_TCE #define __KVM_HAVE_PPC_SMT >>> +#define __KVM_HAVE_GUEST_DEBUG >>> >>> struct kvm_regs { >>> __u64 pc; >>> @@ -265,10 +266,19 @@ struct kvm_fpu { }; >>> >>> struct kvm_debug_exit_arch { >>> + __u32 exception; >>> + __u32 pc; >>> + __u32 status; >>> }; >> >> PC must be 64-bit. What goes in "status" and "exception"? > > ok > >> >>> /* for KVM_SET_GUEST_DEBUG */ >>> struct kvm_guest_debug_arch { >>> + struct { >>> + __u64 addr; >>> + __u32 type; >>> + __u32 pad1; >>> + __u64 pad2; >>> + } bp[16]; >>> }; >> >> What goes in "type"? > > Type denote breakpoint, read watchpoint, write watchpoint or watchpoint (both > read and write). Will adding a comment to describe this is ok? Yes, please make sure all of this is well documented. >>> /* definition of registers in kvm_run */ @@ -285,6 +295,17 @@ struct >>> kvm_sync_regs { >>> #define KVM_CPU_3S_64 4 >>> #define KVM_CPU_E500MC 5 >>> >>> +/* Debug related defines */ >>> +#define KVM_INST_GUESTGDB 0x7C00021C /* ehpriv OC=0 */ >> >> Will this work on all PPC? >> >> It certainly won't work on other architectures, so at a minimum it's >> KVM_PPC_INST_GUEST_GDB, but maybe it needs to be determined at runtime. > > How to determine at run time? adding another ioctl ? Or extend an existing one. Is there any other information about debug capabilities that you expose -- number of hardware breakpoints supported, etc? >>> +#define KVM_GUESTDBG_USE_SW_BP 0x0001 >>> +#define KVM_GUESTDBG_USE_HW_BP 0x0002 >> >> Where do these get used? Any reason for these particular values? If you're >> trying to create a partition where the upper half is generic and the lower >> half >> is arch-specific, say so. > > KVM_SET_GUEST_DEBUG ioctl used to set/unset debug interrupts, which > have a "u32 control" element. We have inherited this mechanism from > x86 implementation and it looks like lower 16 bits are generic (like > KVM_GUESTDBG_ENBLE, KVM_GUESTDBG_SINGLESTEP etc and upper 16 bits are > Architecture specific. > > I will add a comment to describe this. I don't think the sw/hw distinction belongs here -- it should be per breakpoint. >>> + run->exit_reason = KVM_EXIT_DEBUG; >>> + run->debug.arch.pc = vcpu->arch.pc; >>> + run->debug.arch.exception = exit_nr; >>> + run->debug.arch.status = 0; >>> + kvmppc_account_exit(vcpu, DEBUG_EXITS); >>> + return RESUME_HOST; >> >> The interface isn't (clearly labelled as) booke specific, but you return >> booke- >> specific exception numbers. How's userspace supposed to know what to do with >> them? What do you plan on doing with them in QEMU? > > This is booke specific. Then put booke in the name, but what about it really needs to be booke specific? Why does QEMU care about the exception type? >>> +#ifndef CONFIG_PPC_FSL_BOOK3E >>> + PPC_LD(r7, VCPU_HOST_DBG+KVMPPC_DBG_IAC3, r4) >>> + PPC_LD(r8, VCPU_HOST_DBG+KVMPPC_DBG_IAC4, r4) >>> + mtspr SPRN_IAC3, r7 >>> + mtspr SPRN_IAC4, r8 >>> +#endif >> >> Can you handle this at runtime with a feature section? >
Qcow2 preallocation + backing file
I've been searching around trying to figure out if there's any benefit to having a qcow2 backing file created with preallocation=metadata set. That is, if there's any benefit to using a qcow2 file that is based on a qcow2 backing file with preallocation=metadata turned on. I know you can't turn on preallocation=metadata for the derived qcow2 file, but nothing seems to stop you from deriving an image from a base image where the base image has preallocation=metadata turned on. Sorry for the tongue twisters. Looking around the net, it's not super clear (or at least I'm not interpreting what I'm finding correctly) but it sounds like it shouldn't have any benefit, i.e. that you lose the benefit of the base image's preallocation=metadata on derived images. But goofing around with fio randrw, running tests on derived images from backing files with preallocation=metadata vs no preallocation=metadata, it seems to have a huge benefit, at least much much higher r/w bw with the preallocation=metadata-derived image. While I might've answered my own question, I'm no fio expert. I don't necessarily trust that I'm testing it correctly, which is why I'm asking here. Running the same test multiple times gives me 30-40meg/sec for both read and write on the preallocation=metadata-derived image, vs 0.5-1meg/sec for non-preallocation=metadata-derived images, both which sound way too high/low to be right. Same box (ubuntu 12.04 64-bit, running the stock ubuntu qemu 1.0), same disk, even the same directory. I could also see some bizarre side effect happening that I'm not aware of, ready to bite me later on down the road, like, say, that it's silently actually reading/writing the original image (just a made-up example). I also figured that it'd be good to ask for future googlers. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
How KVM handles nested guest's ioctl?
I tried to go through KVM code to check how KVM handles nested guest's ioctl calls. Does L0 handles nested guest's ioctl calls or it passes it to L1 guest to handle it? Can anyone please help me with this? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm: Check if smp_cpus exceeds max cpus supported by kvm
From: Dunrong Huang Add a helper function for fetching max cpus supported by kvm. Make QEMU exit with an error message if smp_cpus exceeds limit of VCPU count retrieved by invoking this helper function. Signed-off-by: Dunrong Huang --- kvm-all.c | 25 + 1 files changed, 25 insertions(+), 0 deletions(-) diff --git a/kvm-all.c b/kvm-all.c index 2148b20..8cb4b92 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -1207,6 +1207,23 @@ static int kvm_irqchip_create(KVMState *s) return 0; } +static int kvm_max_vcpus(KVMState *s) +{ +int max_vcpus = 4; +int ret; +ret = kvm_check_extension(s, KVM_CAP_MAX_VCPUS); +if (ret) { +max_vcpus = ret; +} else { + ret = kvm_check_extension(s, KVM_CAP_NR_VCPUS); +if (ret) { +max_vcpus = ret; +} + } + +return max_vcpus; +} + int kvm_init(void) { static const char upgrade_note[] = @@ -1216,6 +1233,7 @@ int kvm_init(void) const KVMCapabilityInfo *missing_cap; int ret; int i; +int max_vcpus; s = g_malloc0(sizeof(KVMState)); @@ -1256,6 +1274,13 @@ int kvm_init(void) goto err; } +max_vcpus = kvm_max_vcpus(s); +if (smp_cpus > max_vcpus) { +fprintf(stderr, "Number of SMP cpus requested (%d) exceeds max cpus " +"supported by KVM (%d)\n", smp_cpus, max_vcpus); +exit(1); +} + s->vmfd = kvm_ioctl(s, KVM_CREATE_VM, 0); if (s->vmfd < 0) { #ifdef TARGET_S390X -- 1.7.8.6 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [regression] virtio net locks up
On 07/30/2012 07:33 PM, Bernd Schubert wrote: Hello Stefan, Stefan Hajnoczi gmail.com> writes: On Wed, Jan 11, 2012 at 4:18 PM, Bernd Schubert itwm.fraunhofer.de> wrote: On 01/11/2012 05:04 PM, Stefan Hajnoczi wrote: Try pinging the host's IP address from inside the guest. Run tcpdump on the guest's tap interface from the host and observe whether or not you see any packets being sent from the guest. sorry for my terribly late reply. As usual I got distracted by too many other things and then returned the hardware I was running the VMs on. My new desktop system is better suitable to run kvm and I can easily reproduce it now with 3.5 on host and guest side. So its not fixed in recent versions yet. Seems arp requests are still going out, but then don't go in: 17:16:21.202547 ARP, Reply 192.168.123.1 is-at 00:25:90:38:09:cd (oui Unknown), length 28 17:16:21.538724 ARP, Request who-has squeeze1 tell squeeze3, length 28 17:16:21.539026 ARP, Reply squeeze1 is-at 52:54:00:12:34:11 (oui Unknown), length 28 17:16:22.200912 ARP, Request who-has 192.168.123.1 tell squeeze3, length 28 Okay, so it seems networking from the tap device and beyond is fine. rmmod virtio_net inside the guest and then modprobe virtio_net again. See if network connectivity is restored (remember to rerun DHCP or whatever, if necessary). Yep, that makes it work again. But probably is not the real solution ;) It's just another piece of information which helps debug this :). At least nothing has wedged itself into an unrecoverable state. When you said the problem happens without vhost, did you explicitly run vhost=off? Or did you just omit "vhost=on"? It was definitely off and I can confirm that it also locks up with vhost=on and vhost=off with 3.5. This sounds like a guest kernel/driver issue. I recommend testing git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git in the guest to see if this has already been fixed. If you have the -dbg RPMs installed it may be possible to insert a probe into the virtio_net kernel module and observe receive interrupts. This does require the right kernel CONFIG_ but you might already have it enabled: $ sudo perf probe --add skb_recv_done $ sudo perf record -e probe:skb_recv_done -a ...send some packets to the guest... ^C $ sudo perf script If you see no skb_recv_done events then the guest driver is not receiving a notification when packets are received. You can find more about how to use perf-probe(1) at http://blog.vmsplice.net/2011/03/how-to-use-perf-probe.html. Ah nice, I would have used systemtap, but always wanted to check how to do it with perf :) So once the virtio NIC has locked up, I don't get any events from it anymore - until I remove/re-insert the virtio module (including ifup/ifdown). I will try to find some time later on this week to look into it again. Any further ideas how to proceed (I haven't even checked yet how virtio works at all...). I took a quick glance where skb_recv_done is registered at all and traced it back to vp_find_vqs(). Looking into that function I noticed MSI and so tried to boot with pci=nomsi. And indeed I guessed it right, with pci=nomsi I don't get any lockups anymore. Am I the only one booting kvm-qemu usually with enabled MSI? Cheers, Bernd -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [regression] virtio net locks up
Hello Stefan, Stefan Hajnoczi gmail.com> writes: > > On Wed, Jan 11, 2012 at 4:18 PM, Bernd Schubert > itwm.fraunhofer.de> wrote: > > On 01/11/2012 05:04 PM, Stefan Hajnoczi wrote: > >> Try pinging the host's IP address from inside the guest. Run tcpdump > >> on the guest's tap interface from the host and observe whether or not > >> you see any packets being sent from the guest. > > sorry for my terribly late reply. As usual I got distracted by too many other things and then returned the hardware I was running the VMs on. My new desktop system is better suitable to run kvm and I can easily reproduce it now with 3.5 on host and guest side. So its not fixed in recent versions yet. > > > > Seems arp requests are still going out, but then don't go in: > > > > 17:16:21.202547 ARP, Reply 192.168.123.1 is-at 00:25:90:38:09:cd (oui > > Unknown), length 28 > > 17:16:21.538724 ARP, Request who-has squeeze1 tell squeeze3, length 28 > > 17:16:21.539026 ARP, Reply squeeze1 is-at 52:54:00:12:34:11 (oui Unknown), > > length 28 > > 17:16:22.200912 ARP, Request who-has 192.168.123.1 tell squeeze3, length 28 > > Okay, so it seems networking from the tap device and beyond is fine. > > >> rmmod virtio_net inside the guest and then modprobe virtio_net again. > >> See if network connectivity is restored (remember to rerun DHCP or > >> whatever, if necessary). > > > > > > Yep, that makes it work again. But probably is not the real solution ;) > > It's just another piece of information which helps debug this :). At > least nothing has wedged itself into an unrecoverable state. > > When you said the problem happens without vhost, did you explicitly > run vhost=off? Or did you just omit "vhost=on"? It was definitely off and I can confirm that it also locks up with vhost=on and vhost=off with 3.5. > > This sounds like a guest kernel/driver issue. I recommend testing > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git in > the guest to see if this has already been fixed. > > If you have the -dbg RPMs installed it may be possible to insert a > probe into the virtio_net kernel module and observe receive > interrupts. This does require the right kernel CONFIG_ but you might > already have it enabled: > > $ sudo perf probe --add skb_recv_done > $ sudo perf record -e probe:skb_recv_done -a > ...send some packets to the guest... > ^C > $ sudo perf script > > If you see no skb_recv_done events then the guest driver is not > receiving a notification when packets are received. > > You can find more about how to use perf-probe(1) at > http://blog.vmsplice.net/2011/03/how-to-use-perf-probe.html. Ah nice, I would have used systemtap, but always wanted to check how to do it with perf :) So once the virtio NIC has locked up, I don't get any events from it anymore - until I remove/re-insert the virtio module (including ifup/ifdown). I will try to find some time later on this week to look into it again. Any further ideas how to proceed (I haven't even checked yet how virtio works at all...). Thanks, Bernd -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM: MMU: Tracking guest writes through EPT entries ?
Hello List, I am a KVM newbie and studying KVM mmu code. On the existing guest, I am trying to track all guest writes by marking page table entry as read-only in EPT entry [ I am using Intel machine with vmx and ept support ]. Looks like EPT support re-uses shadow page table(SPT) code and hence some of SPT routines. I was thinking of below possible approach. Use pte_list_walk() to traverse through list of sptes and use mmu_spte_update() to flip the PT_WRITABLE_MASK flag. But all SPTEs are not part of any single list; but on separate lists (based on gfn, page level, memory_slot). So, recording all the faulted guest GFN and then using above method work ? Can you please help if there is more direct and cleaner way to do it ? (or if I am totally off the track) Thanks ! -- Sunil -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm-1.1.0 crashing with kernel 3.5.0-rc6
On 07/30/2012 05:07 PM, Chris Clayton wrote: >> With kernel 3.5.0 with b2da15ac26a0c00 reverted, I have just had 15 clean invocations of vanilla qemu-kvm-1.1.1. So that commit would seem to be the problem. >>> >>> Just to be sure, I've run some more tests today. No crashes occurred in >>> 20 runs of vanilla qemu-kvm-1.1.1 on kernel 3.5.0 with b2da15ac26a0c00 >>> reverted. >> >> Ok. I'm trying to reproduce it here on a nested-virt setup, since the >> code looks correct. >> >> What's your preemption settings? >> >> > [chris:~/kernel/linux-3.5.0]$ grep PREEMPT .config > CONFIG_TREE_PREEMPT_RCU=y > CONFIG_PREEMPT_RCU=y > CONFIG_PREEMPT_NOTIFIERS=y > # CONFIG_PREEMPT_NONE is not set > # CONFIG_PREEMPT_VOLUNTARY is not set > CONFIG_PREEMPT=y > CONFIG_PREEMPT_COUNT=y Here's what I think that is happening vcpu_load ... vmx_save_host_state vmx_vcpu_run (ds.cpl, es.cpl cleared by hardware) interrupt push ds, es # pushes bad ds, es schedule vmx_vcpu_put vmx_load_host_state reload ds, es pop ds, es # of other thread's stack iret # other thread runs interrupt schedule # back in vcpu thread interrupt return: pop ds, es # <-- problem iret ... vcpu_put # bad ds, es, but !vmx->host_state.loaded Marcelo, did I miss something here? Unfortunately, my reproducer has ceased to reproduce. But the fix is easy if the analysis above is right. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs
On Sun, 2012-07-29 at 17:54 +0300, Michael S. Tsirkin wrote: > On Tue, Jul 24, 2012 at 02:43:22PM -0600, Alex Williamson wrote: > > This new ioctl enables an eventfd to be triggered when an EOI is > > written for a specified irqchip pin. The first user of this will > > be external device assignment through VFIO, using a level irqfd > > for asserting a PCI INTx interrupt and this interface for de-assert > > and notification once the interrupt is serviced. > > > > Here we make use of the reference counting of the _irq_source > > object allowing us to share it with an irqfd and cleanup regardless > > of the release order. > > > > Signed-off-by: Alex Williamson > > > --- > > > > Documentation/virtual/kvm/api.txt | 21 ++ > > arch/x86/kvm/x86.c|2 > > include/linux/kvm.h | 15 ++ > > include/linux/kvm_host.h | 13 + > > virt/kvm/eventfd.c| 336 > > + > > virt/kvm/kvm_main.c | 11 + > > 6 files changed, 398 insertions(+) > > > > diff --git a/Documentation/virtual/kvm/api.txt > > b/Documentation/virtual/kvm/api.txt > > index 3911e62..8cd6b36 100644 > > --- a/Documentation/virtual/kvm/api.txt > > +++ b/Documentation/virtual/kvm/api.txt > > @@ -1989,6 +1989,27 @@ return the hash table order in the parameter. (If > > the guest is using > > the virtualized real-mode area (VRMA) facility, the kernel will > > re-create the VMRA HPTEs on the next KVM_RUN of any vcpu.) > > > > +4.77 KVM_EOIFD > > + > > +Capability: KVM_CAP_EOIFD > > +Architectures: x86 > > +Type: vm ioctl > > +Parameters: struct kvm_eoifd (in) > > +Returns: 0 on success, < 0 on error > > + > > +KVM_EOIFD allows userspace to receive interrupt EOI notification > > +through an eventfd. > > I thought about it some more, and I think it should be renamed to an > interrupt ack notification than eoi notification. > For example, consider userspace that uses threaded interrupts. > Currently what will happen is each interrupt will be injected > twice, since on eoi device is still asserting it. I don't follow, why is userspace writing an eoi to the ioapic if it hasn't handled the interrupt and why wouldn't the same happen on bare metal? > One fix would be to delay event until interrupt is re-enabled. > Now I am not asking you to fix this immediately, > but I think we should make the interface generic by > saying we report an ack to userspace and not specifically EOI. Using the word "delay" in the context of interrupt delivery raises all sorts of red flags for me, but I really don't understand your argument. > > kvm_eoifd.fd specifies the eventfd used for > > +notification. KVM_EOIFD_FLAG_DEASSIGN is used to de-assign an eoifd > > +once assigned. KVM_EOIFD also requires additional bits set in > > +kvm_eoifd.flags to bind to the proper interrupt line. The > > +KVM_EOIFD_FLAG_LEVEL_IRQFD indicates that kvm_eoifd.key is provided > > +and is a key from a level triggered interrupt (configured from > > +KVM_IRQFD using KVM_IRQFD_FLAG_LEVEL). The EOI notification is bound > > +to the same GSI and irqchip input as the irqfd. Both kvm_eoifd.key > > +and KVM_EOIFD_FLAG_LEVEL_IRQFD must be specified on assignment and > > +de-assignment of KVM_EOIFD. A level irqfd may only be bound to a > > +single eoifd. KVM_CAP_EOIFD_LEVEL_IRQFD indicates support of > > +KVM_EOIFD_FLAG_LEVEL_IRQFD. > > > > Hmm returning the key means we'll need to keep refcounting for source > IDs around forever. I liked passing the fd better: make implementation > match interface and not the other way around. False, a source ID has a finite lifecycle. The fd approach was broken. Holding the irqfd context imposed too many dependencies between eoifd and irqfd necessitating things like one interface disabling another. I thoroughly disagree with that approach. > > 5. The kvm_run structure > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index 9ded39d..8f3164e 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -2171,6 +2171,8 @@ int kvm_dev_ioctl_check_extension(long ext) > > case KVM_CAP_PCI_2_3: > > case KVM_CAP_KVMCLOCK_CTRL: > > case KVM_CAP_IRQFD_LEVEL: > > + case KVM_CAP_EOIFD: > > + case KVM_CAP_EOIFD_LEVEL_IRQFD: > > r = 1; > > break; > > case KVM_CAP_COALESCED_MMIO: > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h > > index b2e6e4f..effb916 100644 > > --- a/include/linux/kvm.h > > +++ b/include/linux/kvm.h > > @@ -619,6 +619,8 @@ struct kvm_ppc_smmu_info { > > #define KVM_CAP_S390_COW 79 > > #define KVM_CAP_PPC_ALLOC_HTAB 80 > > #define KVM_CAP_IRQFD_LEVEL 81 > > +#define KVM_CAP_EOIFD 82 > > +#define KVM_CAP_EOIFD_LEVEL_IRQFD 83 > > > > #ifdef KVM_CAP_IRQ_ROUTING > > > > @@ -694,6 +696,17 @@ struct kvm_irqfd { > > __u8 pad[20]; > > }; > > > > +#define KVM_EOIFD_FLAG_DEASSIGN (1 << 0) > > +/* Available with KVM_CAP_EOIFD_
Re: [PATCH v7 1/2] kvm: Extend irqfd to support level interrupts
On Sun, 2012-07-29 at 18:01 +0300, Michael S. Tsirkin wrote: > On Tue, Jul 24, 2012 at 02:43:14PM -0600, Alex Williamson wrote: > > In order to inject a level interrupt from an external source using an > > irqfd, we need to allocate a new irq_source_id. This allows us to > > assert and (later) de-assert an interrupt line independently from > > users of KVM_IRQ_LINE and avoid lost interrupts. > > > > We also add what may appear like a bit of excessive infrastructure > > around an object for storing this irq_source_id. However, notice > > that we only provide a way to assert the interrupt here. A follow-on > > interface will make use of the same irq_source_id to allow de-assert. > > > > Signed-off-by: Alex Williamson > > I think this tracking of source ids is the root of all the problems > you see with this patchset. > > A source ID is required for an irqfd to be created. > But if source ID exists after irqfd is destroyed then > the next create will fail. Only if there are no available source IDs. > So the only sane thing to do is to make irqfd manage this resource, > clean it up completely when irqfd is gone. > > Not to mention, the patch will be smaller :) The only sane way to do that is to pull the eoifd into KVM_IRQFD and set them up together. That's actually what v1 of this endeavor did. My intention with splitting eoifd from irqfd is that I think EOI notification is potentially useful outside of this usage with irqfds and I wanted an interface that could be used independently. Someday, an irqfd may not be the only way to generate a key. Userspace may also wish to register to receive notification-only for the existing user source ID. I do not think it's sane to have an eoifd configured using KVM_EOIFD and destroyed using KVM_IRQFD. As for smaller patch, I'm not convinced. We still have to watch for POLLHUP, which pulls in the bulk of the code. And using the above approach of pulling eoifd setup into irqfd we have to address what happens to the combined set when either eventfd is closed. By your argument closing the irqfd closes the eoifd, but does closing the eoifd necessarily close the irqfd. If not then we end up with the question of how can an eoifd be added to an existing irqfd. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes July 24th
Juan Quintela writes: > Hi > > Minutes of Today call. > > ahci: agraf > > - how to enable it? > * today is too difficult (alex) > * get it to work as everything else (markus) > And big discussion ensued. Markus & alex will reply with details. > No agreement, and too subtle for me to resume perfectly. > * cdrom still not working > - dates for freeze? q35? > depends on what people does/review? > > Later, Juan. Sorry for the delay. Hope I still remember enough write a worthwhile summary. Alex, please correct me if I misrepresent you. Background -- * -drive if=ide asks the board to connect this drive to its "IDE controller". It's silently ignored if the board doesn't have one. Even if it has one, it may silently ignore some drives. Examples: - PIIX3 boards such as current "pc" connect all if=ide drives to controller "piix3-ide". Bus numbers > 2 fail machine creation. - I expect the forthcoming "q35" board to connect all if=ide drives to controller "ich9-ahci". We hope to upgrade "pc" to this board. - Board "r2d" connects the if=ide,index=0 drive to a non-qdevified controller hw/ide/mmio.c. Any others get ignored. - Board "s390-virtio" connects some if=ide drives[*] to virtio-blk-s390. Yes, that's not really an IDE controller. - Board "highbank" appears to have an ich9-ahci controller, but it doesn't seem to connect any if=ide drives to it. * Likewise, -drive if=scsi asks the board to connect this drive to a SCSI controller. Examples: - PC boards create N+1 "lsi53c895a" controllers, where N is the largest bus number in any -drive if=scsi. The controller auto-connects all if=scsi,bus=I drives for its own bus number I. Ugly as hell. - Board "pseries" works the same, except it creates "spapr-vscsi" controllers. - The various Sun4[cdm] boards create a single "esp" controller, for bus=0. -drive if=scsi with any other bus fails machine initialization. * The meaning of -hda, drive without if=, and so forth depends on the board, it's currently either like if=scsi or like if=ide. * You can use -device to add additional controllers and drives. For instance, if the board provides a PCI bus, then -device ich9-ahci,id=ahci0 creates an ICH9 AHCI controller, and -drive if=none,id=drive0,... -device ide-hd,bus=ahci0.0,drive=drive0 connects an IDE disk to it. Alex's proposal --- Alex wants more users for AHCI. Using it with a board that doesn't connect if=ide drives to an AHCI controller involves -device. Alex thinks that's too hard for users, and prevents adoption. His proposed solution is to create a new interface type "ahci", so that -drive if=ahci "just works". It creates ich9-ahci controllers automatically. Critique * Does the use case "AHCI with a non-AHCI board" justify a new interface type? Would anybody care if we had q35 today? Would we regret adding if=ahci once we have q35? * If the board already has an ich9-ahci, you can't use if=ahci in its current form to add drives to it. if=ahci creates *additional* ich9-ahci controllers. * Do we want a separate interface type for each (sufficiently popular) controller? What about "megasas" when the board provides only "lsi53c895a"? What shall we do when we acquire an ich10 device model? Where does that stop? * Should -drive ever auto-create controllers? if=scsi does, but it's ugly. * A possible alternative: machine option to replace the controller to use for if=ide. Also possible for if=scsi. * Probably more I can't remember anymore. [*] "Some drives": looks like the code attempts to connect up to ten, but I'm not at all sure more than two work. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv5 1/4] Provide userspace IO exit completion callback.
Current code assumes that IO exit was due to instruction emulation and handles execution back to emulator directly. This patch adds new userspace IO exit completion callback that can be set by any other code that caused IO exit to userspace. Signed-off-by: Gleb Natapov --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/x86.c | 92 +++ 2 files changed, 56 insertions(+), 37 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 48e7131..37f4f11 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -414,6 +414,7 @@ struct kvm_vcpu_arch { struct x86_emulate_ctxt emulate_ctxt; bool emulate_regs_need_sync_to_vcpu; bool emulate_regs_need_sync_from_vcpu; + int (*complete_userspace_io)(struct kvm_vcpu *vcpu); gpa_t time; struct pvclock_vcpu_time_info hv_clock; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index b6379e5..41c0c3d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4539,6 +4539,9 @@ static bool retry_instruction(struct x86_emulate_ctxt *ctxt, return true; } +static int complete_emulated_mmio(struct kvm_vcpu *vcpu); +static int complete_emulated_pio(struct kvm_vcpu *vcpu); + int x86_emulate_instruction(struct kvm_vcpu *vcpu, unsigned long cr2, int emulation_type, @@ -4609,13 +4612,16 @@ restart: } else if (vcpu->arch.pio.count) { if (!vcpu->arch.pio.in) vcpu->arch.pio.count = 0; - else + else { writeback = false; + vcpu->arch.complete_userspace_io = complete_emulated_pio; + } r = EMULATE_DO_MMIO; } else if (vcpu->mmio_needed) { if (!vcpu->mmio_is_write) writeback = false; r = EMULATE_DO_MMIO; + vcpu->arch.complete_userspace_io = complete_emulated_mmio; } else if (r == EMULATION_RESTART) goto restart; else @@ -5471,6 +5477,24 @@ static int __vcpu_run(struct kvm_vcpu *vcpu) return r; } +static inline int complete_emulated_io(struct kvm_vcpu *vcpu) +{ + int r; + vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu); + r = emulate_instruction(vcpu, EMULTYPE_NO_DECODE); + srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx); + if (r != EMULATE_DONE) + return 0; + return 1; +} + +static int complete_emulated_pio(struct kvm_vcpu *vcpu) +{ + BUG_ON(!vcpu->arch.pio.count); + + return complete_emulated_io(vcpu); +} + /* * Implements the following, as a state machine: * @@ -5487,47 +5511,37 @@ static int __vcpu_run(struct kvm_vcpu *vcpu) * copy data * exit */ -static int complete_mmio(struct kvm_vcpu *vcpu) +static int complete_emulated_mmio(struct kvm_vcpu *vcpu) { struct kvm_run *run = vcpu->run; struct kvm_mmio_fragment *frag; - int r; - if (!(vcpu->arch.pio.count || vcpu->mmio_needed)) - return 1; + BUG_ON(!vcpu->mmio_needed); - if (vcpu->mmio_needed) { - /* Complete previous fragment */ - frag = &vcpu->mmio_fragments[vcpu->mmio_cur_fragment++]; - if (!vcpu->mmio_is_write) - memcpy(frag->data, run->mmio.data, frag->len); - if (vcpu->mmio_cur_fragment == vcpu->mmio_nr_fragments) { - vcpu->mmio_needed = 0; - if (vcpu->mmio_is_write) - return 1; - vcpu->mmio_read_completed = 1; - goto done; - } - /* Initiate next fragment */ - ++frag; - run->exit_reason = KVM_EXIT_MMIO; - run->mmio.phys_addr = frag->gpa; + /* Complete previous fragment */ + frag = &vcpu->mmio_fragments[vcpu->mmio_cur_fragment++]; + if (!vcpu->mmio_is_write) + memcpy(frag->data, run->mmio.data, frag->len); + if (vcpu->mmio_cur_fragment == vcpu->mmio_nr_fragments) { + vcpu->mmio_needed = 0; if (vcpu->mmio_is_write) - memcpy(run->mmio.data, frag->data, frag->len); - run->mmio.len = frag->len; - run->mmio.is_write = vcpu->mmio_is_write; - return 0; - - } -done: - vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu); - r = emulate_instruction(vcpu, EMULTYPE_NO_DECODE); - srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx); - if (r != EMULATE_DONE) - return 0; - return 1; + return 1; + vcpu->mmio_read_completed = 1; + return complete_emulated_io(vcpu); + } + /* Initiate next fragment */
[PATCHv5 4/4] KVM: emulator: optimize "rep ins" handling.
Optimize "rep ins" by allowing emulator to write back more than one datum at a time. Introduce new operand type OP_MEM_STR which tells writeback() that dst contains pointer to an array that should be written back as opposite to just one data element. Signed-off-by: Gleb Natapov --- arch/x86/include/asm/kvm_emulate.h |4 +++- arch/x86/kvm/emulate.c | 33 - 2 files changed, 31 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h index 8d0fe8f..d1777f8 100644 --- a/arch/x86/include/asm/kvm_emulate.h +++ b/arch/x86/include/asm/kvm_emulate.h @@ -200,8 +200,9 @@ typedef u32 __attribute__((vector_size(16))) sse128_t; /* Type, address-of, and value of an instruction's operand. */ struct operand { - enum { OP_REG, OP_MEM, OP_IMM, OP_XMM, OP_MM, OP_NONE } type; + enum { OP_REG, OP_MEM, OP_MEM_STR, OP_IMM, OP_XMM, OP_MM, OP_NONE } type; unsigned int bytes; + unsigned int count; union { unsigned long orig_val; u64 orig_val64; @@ -221,6 +222,7 @@ struct operand { char valptr[sizeof(unsigned long) + 2]; sse128_t vec_val; u64 mm_val; + void *data; }; }; diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index c22762d..c74bce8 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1251,8 +1251,15 @@ static int pio_in_emulated(struct x86_emulate_ctxt *ctxt, rc->end = n * size; } - memcpy(dest, rc->data + rc->pos, size); - rc->pos += size; + if (ctxt->rep_prefix && !(ctxt->eflags & EFLG_DF)) { + ctxt->dst.data = rc->data + rc->pos; + ctxt->dst.type = OP_MEM_STR; + ctxt->dst.count = (rc->end - rc->pos) / size; + rc->pos = rc->end; + } else { + memcpy(dest, rc->data + rc->pos, size); + rc->pos += size; + } return 1; } @@ -1500,6 +1507,14 @@ static int writeback(struct x86_emulate_ctxt *ctxt) if (rc != X86EMUL_CONTINUE) return rc; break; + case OP_MEM_STR: + rc = segmented_write(ctxt, + ctxt->dst.addr.mem, + ctxt->dst.data, + ctxt->dst.bytes * ctxt->dst.count); + if (rc != X86EMUL_CONTINUE) + return rc; + break; case OP_XMM: write_sse_reg(ctxt, &ctxt->dst.vec_val, ctxt->dst.addr.xmm); break; @@ -2732,7 +2747,7 @@ int emulator_task_switch(struct x86_emulate_ctxt *ctxt, static void string_addr_inc(struct x86_emulate_ctxt *ctxt, int reg, struct operand *op) { - int df = (ctxt->eflags & EFLG_DF) ? -1 : 1; + int df = (ctxt->eflags & EFLG_DF) ? -op->count : op->count; register_address_increment(ctxt, &ctxt->regs[reg], df * op->bytes); op->addr.mem.ea = register_address(ctxt, ctxt->regs[reg]); @@ -3672,7 +3687,7 @@ static struct opcode opcode_table[256] = { I(DstReg | SrcMem | ModRM | Src2Imm, em_imul_3op), I(SrcImmByte | Mov | Stack, em_push), I(DstReg | SrcMem | ModRM | Src2ImmByte, em_imul_3op), - I2bvIP(DstDI | SrcDX | Mov | String, em_in, ins, check_perm_in), /* insb, insw/insd */ + I2bvIP(DstDI | SrcDX | Mov | String | Unaligned, em_in, ins, check_perm_in), /* insb, insw/insd */ I2bvIP(SrcSI | DstDX | String, em_out, outs, check_perm_out), /* outsb, outsw/outsd */ /* 0x70 - 0x7F */ X16(D(SrcImmByte)), @@ -3930,6 +3945,7 @@ static int decode_operand(struct x86_emulate_ctxt *ctxt, struct operand *op, register_address(ctxt, ctxt->regs[VCPU_REGS_RDI]); op->addr.mem.seg = VCPU_SREG_ES; op->val = 0; + op->count = 1; break; case OpDX: op->type = OP_REG; @@ -3973,6 +3989,7 @@ static int decode_operand(struct x86_emulate_ctxt *ctxt, struct operand *op, register_address(ctxt, ctxt->regs[VCPU_REGS_RSI]); op->addr.mem.seg = seg_override(ctxt); op->val = 0; + op->count = 1; break; case OpImmFAddr: op->type = OP_IMM; @@ -4513,8 +4530,14 @@ writeback: string_addr_inc(ctxt, VCPU_REGS_RDI, &ctxt->dst); if (ctxt->rep_prefix && (ctxt->d & String)) { + unsigned int count; struct read_cache *r = &ctxt->io_read; - register_address_increment(ctxt, &ctxt->regs[VCPU_REGS_RCX], -1); + if ((ctxt->d & SrcMask) == SrcSI) + count = ctxt->src.count; + else + count = ctxt->dst.count; + register_addre
[PATCHv5 3/4] KVM: emulator: string_addr_inc() cleanup
Remove unneeded segment argument. Address structure already has correct segment which was put there during decode. Signed-off-by: Gleb Natapov --- arch/x86/kvm/emulate.c | 11 --- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 4d33423..c22762d 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2729,14 +2729,13 @@ int emulator_task_switch(struct x86_emulate_ctxt *ctxt, return (rc == X86EMUL_UNHANDLEABLE) ? EMULATION_FAILED : EMULATION_OK; } -static void string_addr_inc(struct x86_emulate_ctxt *ctxt, unsigned seg, - int reg, struct operand *op) +static void string_addr_inc(struct x86_emulate_ctxt *ctxt, int reg, + struct operand *op) { int df = (ctxt->eflags & EFLG_DF) ? -1 : 1; register_address_increment(ctxt, &ctxt->regs[reg], df * op->bytes); op->addr.mem.ea = register_address(ctxt, ctxt->regs[reg]); - op->addr.mem.seg = seg; } static int em_das(struct x86_emulate_ctxt *ctxt) @@ -4508,12 +4507,10 @@ writeback: ctxt->dst.type = saved_dst_type; if ((ctxt->d & SrcMask) == SrcSI) - string_addr_inc(ctxt, seg_override(ctxt), - VCPU_REGS_RSI, &ctxt->src); + string_addr_inc(ctxt, VCPU_REGS_RSI, &ctxt->src); if ((ctxt->d & DstMask) == DstDI) - string_addr_inc(ctxt, VCPU_SREG_ES, VCPU_REGS_RDI, - &ctxt->dst); + string_addr_inc(ctxt, VCPU_REGS_RDI, &ctxt->dst); if (ctxt->rep_prefix && (ctxt->d & String)) { struct read_cache *r = &ctxt->io_read; -- 1.7.10 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv5 2/4] KVM: emulator: make x86 emulation modes enum instead of defines
Signed-off-by: Gleb Natapov --- arch/x86/include/asm/kvm_emulate.h | 22 ++ arch/x86/kvm/emulate.c |4 +++- 2 files changed, 13 insertions(+), 13 deletions(-) diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h index c764f43..8d0fe8f 100644 --- a/arch/x86/include/asm/kvm_emulate.h +++ b/arch/x86/include/asm/kvm_emulate.h @@ -236,6 +236,15 @@ struct read_cache { unsigned long end; }; +/* Execution mode, passed to the emulator. */ +enum x86emul_mode { + X86EMUL_MODE_REAL, /* Real mode. */ + X86EMUL_MODE_VM86, /* Virtual 8086 mode. */ + X86EMUL_MODE_PROT16,/* 16-bit protected mode. */ + X86EMUL_MODE_PROT32,/* 32-bit protected mode. */ + X86EMUL_MODE_PROT64,/* 64-bit (long) mode.*/ +}; + struct x86_emulate_ctxt { struct x86_emulate_ops *ops; @@ -243,7 +252,7 @@ struct x86_emulate_ctxt { unsigned long eflags; unsigned long eip; /* eip before instruction emulation */ /* Emulated execution mode, represented by an X86EMUL_MODE value. */ - int mode; + enum x86emul_mode mode; /* interruptibility state, as a result of execution of STI or MOV SS */ int interruptibility; @@ -293,17 +302,6 @@ struct x86_emulate_ctxt { #define REPE_PREFIX0xf3 #define REPNE_PREFIX 0xf2 -/* Execution mode, passed to the emulator. */ -#define X86EMUL_MODE_REAL 0/* Real mode. */ -#define X86EMUL_MODE_VM86 1/* Virtual 8086 mode. */ -#define X86EMUL_MODE_PROT16 2/* 16-bit protected mode. */ -#define X86EMUL_MODE_PROT32 4/* 32-bit protected mode. */ -#define X86EMUL_MODE_PROT64 8/* 64-bit (long) mode.*/ - -/* any protected mode */ -#define X86EMUL_MODE_PROT (X86EMUL_MODE_PROT16|X86EMUL_MODE_PROT32| \ - X86EMUL_MODE_PROT64) - /* CPUID vendors */ #define X86EMUL_CPUID_VENDOR_AuthenticAMD_ebx 0x68747541 #define X86EMUL_CPUID_VENDOR_AuthenticAMD_ecx 0x444d4163 diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 10f0136..4d33423 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2210,6 +2210,8 @@ static int em_sysenter(struct x86_emulate_ctxt *ctxt) if (msr_data == 0x0) return emulate_gp(ctxt, 0); break; + default: + break; } ctxt->eflags &= ~(EFLG_VM | EFLG_IF | EFLG_RF); @@ -4338,7 +4340,7 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt) } /* Instruction can only be executed in protected mode */ - if ((ctxt->d & Prot) && !(ctxt->mode & X86EMUL_MODE_PROT)) { + if ((ctxt->d & Prot) && ctxt->mode < X86EMUL_MODE_PROT16) { rc = emulate_ud(ctxt); goto done; } -- 1.7.10 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv5 0/4] improve speed of "rep ins" emulation
And now for something completely different. So this series (or rather the last patch of it) takes different approach to "rep ins" optimization. Instead of writing separate fast path for it do the fast path inside emulator itself. This way nobody can say the code is not reused! Patch 1,2 are now, strictly speaking, not needed, but I think this is still nice cleanup and, in case of patch 1, eliminates some ifs at each KVM_RUN ioctl. Gleb Natapov (4): Provide userspace IO exit completion callback. KVM: emulator: make x86 emulation modes enum instead of defines KVM: emulator: string_addr_inc() cleanup KVM: emulator: optimize "rep ins" handling. arch/x86/include/asm/kvm_emulate.h | 26 +- arch/x86/include/asm/kvm_host.h|1 + arch/x86/kvm/emulate.c | 48 ++- arch/x86/kvm/x86.c | 92 +--- 4 files changed, 104 insertions(+), 63 deletions(-) -- 1.7.10 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm-1.1.0 crashing with kernel 3.5.0-rc6
On 07/30/12 15:03, Avi Kivity wrote: On 07/30/2012 05:00 PM, Chris Clayton wrote: On 07/29/12 20:10, Chris Clayton wrote: Possible culprit: b2da15ac26a0c00. That commit isn't in qermu-kvm-1.1.1. It is in kernel. Sorry, so it is. With kernel 3.5.0 with b2da15ac26a0c00 reverted, I have just had 15 clean invocations of vanilla qemu-kvm-1.1.1. So that commit would seem to be the problem. Just to be sure, I've run some more tests today. No crashes occurred in 20 runs of vanilla qemu-kvm-1.1.1 on kernel 3.5.0 with b2da15ac26a0c00 reverted. Ok. I'm trying to reproduce it here on a nested-virt setup, since the code looks correct. What's your preemption settings? [chris:~/kernel/linux-3.5.0]$ grep PREEMPT .config CONFIG_TREE_PREEMPT_RCU=y CONFIG_PREEMPT_RCU=y CONFIG_PREEMPT_NOTIFIERS=y # CONFIG_PREEMPT_NONE is not set # CONFIG_PREEMPT_VOLUNTARY is not set CONFIG_PREEMPT=y CONFIG_PREEMPT_COUNT=y -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm-1.1.0 crashing with kernel 3.5.0-rc6
On 07/30/2012 05:00 PM, Chris Clayton wrote: > On 07/29/12 20:10, Chris Clayton wrote: > Possible culprit: b2da15ac26a0c00. > > That commit isn't in qermu-kvm-1.1.1. >>> It is in kernel. >>> >> >> Sorry, so it is. >> >> With kernel 3.5.0 with b2da15ac26a0c00 reverted, I have just had 15 >> clean invocations of vanilla qemu-kvm-1.1.1. So that commit would seem >> to be the problem. > > Just to be sure, I've run some more tests today. No crashes occurred in > 20 runs of vanilla qemu-kvm-1.1.1 on kernel 3.5.0 with b2da15ac26a0c00 > reverted. Ok. I'm trying to reproduce it here on a nested-virt setup, since the code looks correct. What's your preemption settings? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm-1.1.0 crashing with kernel 3.5.0-rc6
On 07/29/12 20:10, Chris Clayton wrote: Possible culprit: b2da15ac26a0c00. That commit isn't in qermu-kvm-1.1.1. It is in kernel. Sorry, so it is. With kernel 3.5.0 with b2da15ac26a0c00 reverted, I have just had 15 clean invocations of vanilla qemu-kvm-1.1.1. So that commit would seem to be the problem. Just to be sure, I've run some more tests today. No crashes occurred in 20 runs of vanilla qemu-kvm-1.1.1 on kernel 3.5.0 with b2da15ac26a0c00 reverted. Thanks. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4 3/3] virtio-blk: Add bio-based IO path for virtio-blk
On Mon, Jul 30, 2012 at 11:25:51AM +0930, Rusty Russell wrote: > I consider this approach a half-way step. Quick attempts on my laptop > and I couldn't find a case where the bio path was a loss, but in theory > if the host wasn't doing any reordering and it was a slow device, you'd > want the guest to do so. > > I'm not sure if current qemu can be configured to do such a thing? The host kernel will do the I/O scheduling for you unless you explicitly disable it. And we should be able to assume an administrator will only disable it when they have a reason for it - if not they'll get worse performance for non-virtualized workloads as well. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 3/3] virtio-blk: Add bio-based IO path for virtio-blk
On Mon, Jul 30, 2012 at 12:43:12PM +0800, Asias He wrote: > I think we can add REQ_FLUSH & REQ_FUA support to bio path and that > deserves another patch. Adding it is a requirement for merging the code. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 3/3] virtio-blk: Add bio-based IO path for virtio-blk
On Mon, Jul 30, 2012 at 09:31:06AM +0200, Paolo Bonzini wrote: > You only need to add REQ_FLUSH support. The virtio-blk protocol does > not support REQ_FUA, because there's no easy way to do it in userspace. A bio-based driver needs to handle both REQ_FLUSH and REQ_FUA as it does not get the sequencing of REQ_FUA into REQ_FLUSH that request based drivers can request. To what the REQ_FUA request gets translated is a different story. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next RFC V5 4/5] virtio_net: multiqueue support
On 07/29/2012 11:44 AM, Michael S. Tsirkin wrote: > On Sat, Jul 21, 2012 at 02:02:58PM +0200, Sasha Levin wrote: >> On 07/20/2012 03:40 PM, Michael S. Tsirkin wrote: - err = init_vqs(vi); > + if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ)) > + vi->has_cvq = true; > + >>> How about we disable multiqueue if there's no cvq? >>> Will make logic a bit simpler, won't it? >> >> multiqueues don't really depend on cvq. Does this added complexity really >> justifies adding an artificial limit? > > Well !cvq support is a legacy feature: the reason we support it > in driver is to avoid breaking on old hosts. Adding more code to that > path just doesn't make much sense since old hosts won't have mq. Is it really a legacy feature? The spec suggests that its an optional queue which is not necessary for the operation of the device. Which is why we never implemented it in lkvm - we weren't interested in any of the features it provided at that time and we could provide high performance with vhost support even without it. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL] virtio
The following changes since commit f7da9cdf45cbbad5029d4858dcbc0134e06084ed: Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2012-07-28 06:00:39 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus.git tags/virtio-for-linus for you to fetch changes up to 6a743897144500fb4c4566ced3a498d5180fbb5b: virtio-blk: return VIRTIO_BLK_F_FLUSH to header. (2012-07-30 13:30:52 +0930) Virtio patches, mainly hotplugging fixes. Amit Shah (5): virtio ids: fix comment for virtio-rng virtio: rng: allow tasks to be killed that are waiting for rng input virtio: rng: don't wait on host when module is going away virtio: rng: split out common code in probe / remove for s3/s4 ops virtio: rng: s3/s4 support Asias He (3): virtio-blk: Call del_gendisk() before disable guest kick virtio-blk: Reset device after blk_cleanup_queue() virtio-blk: Use block layer provided spinlock Paolo Bonzini (1): virtio-blk: allow toggling host cache between writeback and writethrough Rusty Russell (1): virtio-blk: return VIRTIO_BLK_F_FLUSH to header. drivers/block/virtio_blk.c | 115 --- drivers/char/hw_random/virtio-rng.c | 37 ++- include/linux/virtio_blk.h | 10 ++- include/linux/virtio_ids.h |2 +- 4 files changed, 137 insertions(+), 27 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
9p broken?
Having an annoying bug on i386 kvm I decided to debug it buy running an i386 guest on my x86_64 host, use 9p to access a guest image, and run it using nested kvm. However, 9p appears to be broken: first, the configure test fails (patch sent). Second, while mount works, ls on the mount point causes qemu to crash with (gdb) bt #0 error_set (errp=0x7fffe95fb128, fmt=0x558d4568 "{ 'class': 'VirtFSFeatureBlocksMigration', 'data': { 'path': %s, 'tag': %s } }") at /home/tlv/akivity/qemu/error.c:32 #1 0x5567cb06 in v9fs_attach (opaque=0x7fffe95e3020) at /home/tlv/akivity/qemu/hw/9pfs/virtio-9p.c:988 #2 0x5561d19f in coroutine_trampoline (i0=1449767888, i1=21845) at /home/tlv/akivity/qemu/coroutine-ucontext.c:138 #3 0x75a93ef0 in ?? () from /lib64/libc.so.6 #4 0x7fffce00 in ?? () #5 0x in ?? ( **errp already points to a VirtFSFeatureBlocksMigration error; v9fs_attach() has been called a second time (the first time, understandably, on mount; the second on ls). Command line: qemu-system-x86_64 -m 4G -smp 4 -drive file=/images/Fedora-i386.img,if=virtio,cache=none -cdrom /images/iso/bfo.iso -device virtio-9p-pci,fsdev=root,mount_tag=root -fsdev local,id=root,path=/,security_model=passthrough -enable-kvm -net nic,model=virtio,netdev=net -netdev user,id=net -monitor stdio -cpu host -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] powerpc/kvm: Allow qemu hypercalls to set all regs
On 27.07.2012, at 08:18, Benjamin Herrenschmidt wrote: > Right now, whenever we exit for an hcall, upon return, we fetch > some register values from the structure that carries the hcall > results and update the vcpu accordingly. > > However, if the hypercall chooses instead to update the registers > itself by calling set_regs, then we end up clobbering those values. > > This is for example the case of the rtas calls used to reboot the > machine where a new CPU state is established and must not be clobbered. > > The simple fix is to always clear the hcall_needed flag when a > set-regs happen. This fixes reboot problems. I do understand the race you're running into. However, I don't understand why. QEMU's hcall return code fetches its GPRs from env->gprs. That should be perfectly fine if the values are 0 before you return from the hcall function. Could you please try to manually set the relevant regs to 0 in QEMU and see if that also fixes reboot for you? That'd keep the interface a bit less complex. Alex > > Signed-off-by: Benjamin Herrenschmidt > --- > > Note: There are similar issues if the reboot is triggered by an MMIO > emulation, or an OSI call, Alex, you might want to handle those cases > as well. In any cases, I'd like this to still go into 3.6 but I can > send it myself to Linus if you want. > > diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c > index 3f2a836..9ab13d6 100644 > --- a/arch/powerpc/kvm/book3s.c > +++ b/arch/powerpc/kvm/book3s.c > @@ -463,6 +463,14 @@ int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, > struct kvm_regs *regs) > for (i = 0; i < ARRAY_SIZE(regs->gpr); i++) > kvmppc_set_gpr(vcpu, i, regs->gpr[i]); > > + /* > + * If the hypercall decides to change register values > + * explicitly, we must ensure that we do not override > + * them upon return from that hypercall. Among others > + * this happens on system reset > + */ > + vcpu->arch.hcall_needed = 0; > + > return 0; > } > > diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c > index 87f4dc8..75e196d 100644 > --- a/arch/powerpc/kvm/powerpc.c > +++ b/arch/powerpc/kvm/powerpc.c > @@ -592,6 +592,11 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, > struct kvm_run *run) > } else if (vcpu->arch.hcall_needed) { > int i; > > + /* > + * Note: This might not be called if the hypervisor > + * call has done a set_regs(). In this case hcall_needed > + * is cleared. This is necessary for reset to work properly > + */ > kvmppc_set_gpr(vcpu, 3, run->papr_hcall.ret); > for (i = 0; i < 9; ++i) > kvmppc_set_gpr(vcpu, 4 + i, run->papr_hcall.args[i]); > > > -- > To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] KVM: PPC: booke: Added debug handler
Installed debug handler will be used for guest debug support and debug facility emulation features (patches for these features will follow this patch). Signed-off-by: Liu Yu [bharat.bhus...@freescale.com: Substantial changes] Signed-off-by: Bharat Bhushan --- arch/powerpc/include/asm/kvm_host.h |1 + arch/powerpc/kernel/asm-offsets.c |1 + arch/powerpc/kvm/booke_interrupts.S | 45 +++ 3 files changed, 47 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 7a45194..daa 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -492,6 +492,7 @@ struct kvm_vcpu_arch { u32 tlbcfg[4]; u32 mmucfg; u32 epr; + u32 crit_save; #endif gpa_t paddr_accessed; gva_t vaddr_accessed; diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index 52c7ad7..6f7c1c9 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -562,6 +562,7 @@ int main(void) DEFINE(VCPU_LAST_INST, offsetof(struct kvm_vcpu, arch.last_inst)); DEFINE(VCPU_FAULT_DEAR, offsetof(struct kvm_vcpu, arch.fault_dear)); DEFINE(VCPU_FAULT_ESR, offsetof(struct kvm_vcpu, arch.fault_esr)); + DEFINE(VCPU_CRIT_SAVE, offsetof(struct kvm_vcpu, arch.crit_save)); #endif /* CONFIG_PPC_BOOK3S */ #endif /* CONFIG_KVM */ diff --git a/arch/powerpc/kvm/booke_interrupts.S b/arch/powerpc/kvm/booke_interrupts.S index c7aebc5..e002c34 100644 --- a/arch/powerpc/kvm/booke_interrupts.S +++ b/arch/powerpc/kvm/booke_interrupts.S @@ -74,6 +74,51 @@ _GLOBAL(kvmppc_handler_\ivor_nr) bctr .endm +.macro KVM_DBG_HANDLER ivor_nr scratch srr0 +_GLOBAL(kvmppc_handler_\ivor_nr) + mtspr \scratch, r4 + mfspr r4, SPRN_SPRG_THREAD + lwz r4, THREAD_KVM_VCPU(r4) + stw r3, VCPU_CRIT_SAVE(r4) + mfcrr3 + mfspr r4, SPRN_CSRR1 + andi. r4, r4, MSR_PR + bne 1f + /* debug interrupt happened in enter/exit path */ + mfspr r4, SPRN_CSRR1 + rlwinm r4, r4, 0, ~MSR_DE + mtspr SPRN_CSRR1, r4 + lis r4, 0x + ori r4, r4, 0x + mtspr SPRN_DBSR, r4 + mfspr r4, SPRN_SPRG_THREAD + lwz r4, THREAD_KVM_VCPU(r4) + mtcrr3 + lwz r3, VCPU_CRIT_SAVE(r4) + mfspr r4, \scratch + rfci +1: /* debug interrupt happened in guest */ + mfspr r4, \scratch + mtcrr3 + mr r3, r4 + mfspr r4, SPRN_SPRG_THREAD + lwz r4, THREAD_KVM_VCPU(r4) + stw r3, VCPU_GPR(r4)(r4) + stw r5, VCPU_GPR(r5)(r4) + stw r6, VCPU_GPR(r6)(r4) + lwz r3, VCPU_CRIT_SAVE(r4) + mfspr r5, \srr0 + stw r3, VCPU_GPR(r3)(r4) + stw r5, VCPU_PC(r4) + mfctr r5 + lis r6, kvmppc_resume_host@h + stw r5, VCPU_CTR(r4) + li r5, \ivor_nr + ori r6, r6, kvmppc_resume_host@l + mtctr r6 + bctr +.endm + .macro KVM_HANDLER_ADDR ivor_nr .long kvmppc_handler_\ivor_nr .endm -- 1.7.0.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] KVM: PPC: booke: Allow multiple exception types
Current kvmppc_booke_handlers uses the same macro (KVM_HANDLER) and all handlers are considered to be the same size. This will not be the case if we want to use different macros for different handlers. This patch improves the kvmppc_booke_handler so that it can support different macros for different handlers. Signed-off-by: Liu Yu [bharat.bhus...@freescale.com: Substantial changes] Signed-off-by: Bharat Bhushan --- arch/powerpc/include/asm/kvm_ppc.h |2 - arch/powerpc/kvm/booke.c|9 --- arch/powerpc/kvm/booke.h|1 + arch/powerpc/kvm/booke_interrupts.S | 37 -- arch/powerpc/kvm/e500.c | 13 +++ 5 files changed, 48 insertions(+), 14 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 823d563..6d3e665 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -47,8 +47,6 @@ enum emulation_result { extern int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu); extern int __kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu); -extern char kvmppc_handlers_start[]; -extern unsigned long kvmppc_handler_len; extern void kvmppc_handler_highmem(void); extern void kvmppc_dump_vcpu(struct kvm_vcpu *vcpu); diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index 47a7925..6fbdcfc 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -1528,6 +1528,7 @@ int __init kvmppc_booke_init(void) { #ifndef CONFIG_KVM_BOOKE_HV unsigned long ivor[16]; + unsigned long *handler = kvmppc_booke_handler_addr; unsigned long max_ivor = 0; int i; @@ -1561,14 +1562,14 @@ int __init kvmppc_booke_init(void) for (i = 0; i < 16; i++) { if (ivor[i] > max_ivor) - max_ivor = ivor[i]; + max_ivor = i; memcpy((void *)kvmppc_booke_handlers + ivor[i], - kvmppc_handlers_start + i * kvmppc_handler_len, - kvmppc_handler_len); + (void *)handler[i], handler[i + 1] - handler[i]); } flush_icache_range(kvmppc_booke_handlers, - kvmppc_booke_handlers + max_ivor + kvmppc_handler_len); + kvmppc_booke_handlers + ivor[max_ivor] + + handler[max_ivor + 1] - handler[max_ivor]); #endif /* !BOOKE_HV */ return 0; } diff --git a/arch/powerpc/kvm/booke.h b/arch/powerpc/kvm/booke.h index ba61974..de9e526 100644 --- a/arch/powerpc/kvm/booke.h +++ b/arch/powerpc/kvm/booke.h @@ -65,6 +65,7 @@ (1 << BOOKE_IRQPRIO_CRITICAL)) extern unsigned long kvmppc_booke_handlers; +extern unsigned long kvmppc_booke_handler_addr[]; void kvmppc_set_msr(struct kvm_vcpu *vcpu, u32 new_msr); void kvmppc_mmu_msr_notify(struct kvm_vcpu *vcpu, u32 old_msr); diff --git a/arch/powerpc/kvm/booke_interrupts.S b/arch/powerpc/kvm/booke_interrupts.S index 09456c4..c7aebc5 100644 --- a/arch/powerpc/kvm/booke_interrupts.S +++ b/arch/powerpc/kvm/booke_interrupts.S @@ -74,6 +74,14 @@ _GLOBAL(kvmppc_handler_\ivor_nr) bctr .endm +.macro KVM_HANDLER_ADDR ivor_nr + .long kvmppc_handler_\ivor_nr +.endm + +.macro KVM_HANDLER_END + .long kvmppc_handlers_end +.endm + _GLOBAL(kvmppc_handlers_start) KVM_HANDLER BOOKE_INTERRUPT_CRITICAL SPRN_SPRG_RSCRATCH_CRIT SPRN_CSRR0 KVM_HANDLER BOOKE_INTERRUPT_MACHINE_CHECK SPRN_SPRG_RSCRATCH_MC SPRN_MCSRR0 @@ -94,9 +102,7 @@ KVM_HANDLER BOOKE_INTERRUPT_DEBUG SPRN_SPRG_RSCRATCH_CRIT SPRN_CSRR0 KVM_HANDLER BOOKE_INTERRUPT_SPE_UNAVAIL SPRN_SPRG_RSCRATCH0 SPRN_SRR0 KVM_HANDLER BOOKE_INTERRUPT_SPE_FP_DATA SPRN_SPRG_RSCRATCH0 SPRN_SRR0 KVM_HANDLER BOOKE_INTERRUPT_SPE_FP_ROUND SPRN_SPRG_RSCRATCH0 SPRN_SRR0 - -_GLOBAL(kvmppc_handler_len) - .long kvmppc_handler_1 - kvmppc_handler_0 +_GLOBAL(kvmppc_handlers_end) /* Registers: * SPRG_SCRATCH0: guest r4 @@ -464,6 +470,31 @@ lightweight_exit: lwz r4, VCPU_GPR(r4)(r4) rfi + .data + .align 4 + .globl kvmppc_booke_handler_addr +kvmppc_booke_handler_addr: +KVM_HANDLER_ADDR BOOKE_INTERRUPT_CRITICAL +KVM_HANDLER_ADDR BOOKE_INTERRUPT_MACHINE_CHECK +KVM_HANDLER_ADDR BOOKE_INTERRUPT_DATA_STORAGE +KVM_HANDLER_ADDR BOOKE_INTERRUPT_INST_STORAGE +KVM_HANDLER_ADDR BOOKE_INTERRUPT_EXTERNAL +KVM_HANDLER_ADDR BOOKE_INTERRUPT_ALIGNMENT +KVM_HANDLER_ADDR BOOKE_INTERRUPT_PROGRAM +KVM_HANDLER_ADDR BOOKE_INTERRUPT_FP_UNAVAIL +KVM_HANDLER_ADDR BOOKE_INTERRUPT_SYSCALL +KVM_HANDLER_ADDR BOOKE_INTERRUPT_AP_UNAVAIL +KVM_HANDLER_ADDR BOOKE_INTERRUPT_DECREMENTER +KVM_HANDLER_ADDR BOOKE_INTERRUPT_FIT +KVM_HANDLER_ADDR BOOKE_INTERRUPT_WATCHDOG +KVM_HANDLER_ADDR BOOKE_INTERRUPT_DTLB_MISS +KVM_HANDLER_ADDR BOOKE_INTERRUPT_ITLB_MISS +KVM_HANDLER_ADDR BOOKE_INTERRUPT_DEBUG +KVM_HANDLER_ADDR BOOKE_INTERRUPT_SPE_UNAVAIL +KVM_HA
[PATCH] KVM: PPC: Book3S HV: Handle memory slot deletion and modification
>From 4df5da62cb49b71c9a70072341dcb0557768c520 Mon Sep 17 00:00:00 2001 From: Paul Mackerras Date: Mon, 30 Jul 2012 16:40:54 +1000 Subject: [PATCH] KVM: PPC: Book3S HV: Handle memory slot deletion and modification correctly At present the HV style of KVM doesn't handle deletion or modification of memory slots correctly. Deletion occurs when userspace does the KVM_SET_USER_MEMORY_REGION with arguments specifying a new length of zero for a slot that contains memory. Modification occurs when the arguments specify a new guest_phys_addr or flags. To allow the HV code to know which operation (creation, deletion or modification) is being requested, it needs to see the old and new contents of the memslot. kvm_arch_prepare_memory_region has this information, so we modify it to pass it down to kvmppc_core_prepare_memory_region. We then modify the HV version of that to check which operation is being performed. If it is a deletion, we call a new function kvmppc_unmap_memslot to remove any HPT (hashed page table) entries referencing the memory being removed. Similarly, if we are modifying the guest physical address we also remove any HPT entries. If the KVM_MEM_LOG_DIRTY_PAGES flag is now set and wasn't before, we call kvmppc_hv_get_dirty_log to clear any dirty bits so we can detect all modifications from now on. Signed-off-by: Paul Mackerras --- arch/powerpc/include/asm/kvm_ppc.h |4 +++ arch/powerpc/kvm/book3s_64_mmu_hv.c | 27 ++-- arch/powerpc/kvm/book3s_hv.c| 61 +++ arch/powerpc/kvm/book3s_hv_rm_mmu.c |2 +- arch/powerpc/kvm/book3s_pr.c|2 ++ arch/powerpc/kvm/booke.c|2 ++ arch/powerpc/kvm/powerpc.c |2 +- 7 files changed, 76 insertions(+), 24 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 0124937..044c921 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -140,11 +140,15 @@ extern void kvm_release_hpt(struct kvmppc_linear_info *li); extern int kvmppc_core_init_vm(struct kvm *kvm); extern void kvmppc_core_destroy_vm(struct kvm *kvm); extern int kvmppc_core_prepare_memory_region(struct kvm *kvm, +struct kvm_memory_slot *memslot, +struct kvm_memory_slot *old, struct kvm_userspace_memory_region *mem); extern void kvmppc_core_commit_memory_region(struct kvm *kvm, struct kvm_userspace_memory_region *mem); extern int kvm_vm_ioctl_get_smmu_info(struct kvm *kvm, struct kvm_ppc_smmu_info *info); +extern void kvmppc_unmap_memslot(struct kvm *kvm, +struct kvm_memory_slot *memslot); extern int kvmppc_bookehv_init(void); extern void kvmppc_bookehv_exit(void); diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index d03eb6f..69fee3b 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -830,7 +830,8 @@ static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp, psize = hpte_page_size(hptep[0], ptel); if ((hptep[0] & HPTE_V_VALID) && hpte_rpn(ptel, psize) == gfn) { - hptep[0] |= HPTE_V_ABSENT; + if (kvm->arch.using_mmu_notifiers) + hptep[0] |= HPTE_V_ABSENT; kvmppc_invalidate_hpte(kvm, hptep, i); /* Harvest R and C */ rcbits = hptep[1] & (HPTE_R_R | HPTE_R_C); @@ -850,6 +851,28 @@ int kvm_unmap_hva(struct kvm *kvm, unsigned long hva) return 0; } +void kvmppc_unmap_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot) +{ + unsigned long *rmapp; + unsigned long gfn; + unsigned long n; + + rmapp = memslot->rmap; + gfn = memslot->base_gfn; + for (n = memslot->npages; n; --n) { + /* +* Testing the present bit without locking is OK because +* the memslot has been marked invalid already, and hence +* no new HPTEs referencing this page can be created, +* thus the present bit can't go from 0 to 1. +*/ + if (*rmapp & KVMPPC_RMAP_PRESENT) + kvm_unmap_rmapp(kvm, rmapp, gfn); + ++rmapp; + ++gfn; + } +} + static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp, unsigned long gfn) { @@ -1012,7 +1035,7 @@ long kvmppc_hv_get_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot) rmapp = memslot->rmap; map = memslot->dirty_bitmap; for (i = 0; i < memslot->npages; ++i) { - if (kvm_test_clear_dirty(kvm, rmapp)) + if (kvm_test_clear_dirty(kvm, rmapp) &
Re: KVM call agenda for Tuesday, July 31
Il 30/07/2012 09:34, Juan Quintela ha scritto: > > Hi > > Please send in any agenda items you are interested in covering. How/when to merge the "new error" series. My proposal: make soft feature freeze shorter, delay rc0 slightly, only take care of errors in the meanwhile. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)
On 07/30/2012 10:12 AM, Paolo Bonzini wrote: > Il 30/07/2012 01:50, Rusty Russell ha scritto: >>> Also, being the first user of chained scatterlist doesn't exactly give >>> me warm fuzzies. >> >> We're far from the first user: they've been in the kernel for well over >> 7 years. They were introduced for the block layer, but they tended to >> ignore private uses of scatterlists like this one. > > Yeah, but sg_chain has no users in drivers, only a private one in > lib/scatterlist.c. The internal API could be changed to something else > and leave virtio-scsi screwed... > >> Yes, we should do this. But note that this means an iteration, so we >> might as well combine the loops :) > > I'm really bad at posting pseudo-code, but you can count the number of > physically-contiguous entries at the beginning of the list only. So if > everything is contiguous, you use a single non-indirect buffer and save > a kmalloc. If you use indirect buffers, I suspect it's much less > effective to collapse physically-contiguous entries. More elaborate > heuristics do need a loop, though. > [All the below with a grain of salt, from my senile memory] You must not forget some facts about the scatterlist received here at the LLD. It has already been DMA mapped and locked by the generic layer. Which means that the DMA engine has already collapsed physically-contiguous entries. Those you get here are already unique physically. (There were bugs in the past, where this was not true, please complain if you find them again) A scatterlist is two different lists taking the same space, but with two different length. - One list is the PAGE pointers plus offset && length, which is bigger or equal to the 2nd list. The end marker corresponds to this list. This list is the input into the DMA engine. - Second list is the physical DMA addresses list. With their physical-lengths. Offset is not needed because it is incorporated in the DMA address. This list is the output from the DMA engine. The reason 2nd list is shorter is because the DMA engine tries to minimize the physical scatter-list entries which is usually a limited HW resource. This list might follow chains but it's end is determined by the received sg_count from the DMA engine, not by the end marker. At the time my opinion, and I think Rusty agreed, was that the scatterlist should be split in two. The input page-ptr list is just the BIO, and the output of the DMA-engine should just be the physical part of the sg_list, as a separate parameter. But all this was berried under too much APIs and the noise was two strong, for any single brave sole. So I'd just trust blindly the returned sg_count from the DMA engine, it is already optimized. I THINK > Paolo Boaz -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH stub] kvm: caching API for interrupts
On 07/29/2012 11:00 PM, Michael S. Tsirkin wrote: > I've been looking at adding caching for IRQs so that we don't need to > scan all VCPUs on each interrupt. One issue I had a problem with, is > how the cache structure can be used from both a thread (to fill out the > cache) and interrupt (to actually send if cache is valid). > > For now just added a lock field in the cache so we don't need to worry > about this, and with such a lock in place we don't have to worry about > RCU as cache can be invalidated simply under this lock. > > For now this just declares the structure and updates the APIs > so it's not intended to be applied, but just to give you > the idea. > > Comments, flames wellcome. I hope you aren't expecting any of the latter. > > +struct kvm_irq_cache { > + spinlock_t lock; > + bool valid; > + struct kvm_vcpu *dest; > + /* For now we only cache lapic irqs */ > + struct kvm_lapic_irq irq; > + /* Protected by kvm->irq_cache_lock */ > + struct list_head list; > +}; > + Why an external structure? Why not add something to kvm_kernel_irq_routing_entry? It is already protected by rcu. The atomic context code could look like this: if (kire->cache.valid) { kvm_apic_set_irq(kire->cache.vcpu, &kire->cache.irq); return; } queue_work(process_irq_slowpath); The slow path tries to fills the cache, and processes the interrupt in any case. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: How does QEMU communicate with KVM?
Sorry for sending garbage mail There are several ways that qemu and kvm can communicate. One of them is the ioctl to /dev/kvm and it is documented in Documentation/virtual/kvm/api.txt. But there's other ways also: like the pages that are mmap-ed from kvm_vcpu_fault function, I’m not sure about the documentation of such channels. regards Min-gyu Kim --- Original Message --- Sender : Richard Yao Date : 2012-07-30 14:47 (GMT+09:00) Title : How does QEMU communicate with KVM? Is there any documentation on how QEMU communicates with KVM? From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of 김민규 Sent: Monday, July 30, 2012 5:23 PM To: Richard Yao; kvm@vger.kernel.org Subject: Re: How does QEMU communicate with KVM? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does QEMU communicate with KVM?
There are several ways that qemu and kvm can communicate. One of them is the ioctl to /dev/kvm and it is documented in Documentation/virtual/kvm/api.txt. But there's other ways also: like the pages that are mmap-ed from kvm_vcpu_fault function, As far as I know, there is no documentation on such communication channel(but i'm not sure) regards Min-gyu Kim --- Original Message --- Sender : Richard Yao Date : 2012-07-30 14:47 (GMT+09:00) Title : How does QEMU communicate with KVM? Is there any documentation on how QEMU communicates with KVM?
RE: [PATCH 2/2] KVM: PPC: booke/bookehv: Add guest debug support
> -Original Message- > From: Wood Scott-B07421 > Sent: Friday, July 27, 2012 7:00 AM > To: Bhushan Bharat-R65777 > Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; ag...@suse.de; Bhushan > Bharat- > R65777 > Subject: Re: [PATCH 2/2] KVM: PPC: booke/bookehv: Add guest debug support > > On 07/26/2012 12:32 AM, Bharat Bhushan wrote: > > This patch adds: > > 1) KVM debug handler added for e500v2. > > 2) Guest debug by qemu gdb stub. > > Does it make sense for these to both be in the same patch? If there's common > code used by both, that could be added first. ok > > > Signed-off-by: Liu Yu > > Signed-off-by: Varun Sethi > > [bharat.bhus...@freescale.com: Substantial changes] > > Signed-off-by: Bharat Bhushan > > --- > > arch/powerpc/include/asm/kvm.h| 21 + > > arch/powerpc/include/asm/kvm_host.h |7 ++ > > arch/powerpc/include/asm/kvm_ppc.h|2 + > > arch/powerpc/include/asm/reg_booke.h |1 + > > arch/powerpc/kernel/asm-offsets.c | 31 ++- > > arch/powerpc/kvm/booke.c | 146 +++--- > > arch/powerpc/kvm/booke_interrupts.S | 160 > - > > arch/powerpc/kvm/bookehv_interrupts.S | 141 - > > arch/powerpc/kvm/e500mc.c |3 +- > > arch/powerpc/kvm/powerpc.c|2 +- > > 10 files changed, 492 insertions(+), 22 deletions(-) > > > > diff --git a/arch/powerpc/include/asm/kvm.h > > b/arch/powerpc/include/asm/kvm.h index 3c14202..da71c84 100644 > > --- a/arch/powerpc/include/asm/kvm.h > > +++ b/arch/powerpc/include/asm/kvm.h > > @@ -25,6 +25,7 @@ > > /* Select powerpc specific features in */ #define > > __KVM_HAVE_SPAPR_TCE #define __KVM_HAVE_PPC_SMT > > +#define __KVM_HAVE_GUEST_DEBUG > > > > struct kvm_regs { > > __u64 pc; > > @@ -265,10 +266,19 @@ struct kvm_fpu { }; > > > > struct kvm_debug_exit_arch { > > + __u32 exception; > > + __u32 pc; > > + __u32 status; > > }; > > PC must be 64-bit. What goes in "status" and "exception"? ok > > > /* for KVM_SET_GUEST_DEBUG */ > > struct kvm_guest_debug_arch { > > + struct { > > + __u64 addr; > > + __u32 type; > > + __u32 pad1; > > + __u64 pad2; > > + } bp[16]; > > }; > > What goes in "type"? Type denote breakpoint, read watchpoint, write watchpoint or watchpoint (both read and write). Will adding a comment to describe this is ok? > > > /* definition of registers in kvm_run */ @@ -285,6 +295,17 @@ struct > > kvm_sync_regs { > > #define KVM_CPU_3S_64 4 > > #define KVM_CPU_E500MC 5 > > > > +/* Debug related defines */ > > +#define KVM_INST_GUESTGDB 0x7C00021C /* ehpriv OC=0 */ > > Will this work on all PPC? > > It certainly won't work on other architectures, so at a minimum it's > KVM_PPC_INST_GUEST_GDB, but maybe it needs to be determined at runtime. How to determine at run time? adding another ioctl ? > > > +#define KVM_GUESTDBG_USE_SW_BP 0x0001 > > +#define KVM_GUESTDBG_USE_HW_BP 0x0002 > > Where do these get used? Any reason for these particular values? If you're > trying to create a partition where the upper half is generic and the lower > half > is arch-specific, say so. KVM_SET_GUEST_DEBUG ioctl used to set/unset debug interrupts, which have a "u32 control" element. We have inherited this mechanism from x86 implementation and it looks like lower 16 bits are generic (like KVM_GUESTDBG_ENBLE, KVM_GUESTDBG_SINGLESTEP etc and upper 16 bits are Architecture specific. I will add a comment to describe this. > > > diff --git a/arch/powerpc/include/asm/kvm_host.h > > b/arch/powerpc/include/asm/kvm_host.h > > index 7a45194..524af7a 100644 > > --- a/arch/powerpc/include/asm/kvm_host.h > > +++ b/arch/powerpc/include/asm/kvm_host.h > > @@ -458,7 +458,12 @@ struct kvm_vcpu_arch { > > u32 ccr0; > > u32 ccr1; > > u32 dbsr; > > + /* guest debug regiters*/ > > struct kvmppc_booke_debug_reg dbg_reg; > > + /* shadow debug registers */ > > + struct kvmppc_booke_debug_reg shadow_dbg_reg; > > + /* host debug regiters*/ > > + struct kvmppc_booke_debug_reg host_dbg_reg; > > s/regiter/register/g > > ...and put a space before */ > > > @@ -492,6 +497,7 @@ struct kvm_vcpu_arch { > > u32 tlbcfg[4]; > > u32 mmucfg; > > u32 epr; > > + u32 crit_save; > > #endif > > What is crit_save? > > > gpa_t paddr_accessed; > > gva_t vaddr_accessed; > > @@ -533,6 +539,7 @@ struct kvm_vcpu_arch { > > struct kvm_vcpu_arch_shared *shared; > > unsigned long magic_page_pa; /* phys addr to map the magic page to */ > > unsigned long magic_page_ea; /* effect. addr to map the magic page > > to */ > > + struct kvm_guest_debug_arch dbg; /* debug arg between kvm and qemu > > +*/ > > Is kvm_guest_debug_arch generic or PPC-specific? If the former, why is it in > a > ppc struct? If the latter, why doesn't it h
KVM call agenda for Tuesday, July 31
Hi Please send in any agenda items you are interested in covering. Thanks, Juan. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 3/3] virtio-blk: Add bio-based IO path for virtio-blk
Il 30/07/2012 06:43, Asias He ha scritto: >> > > Yes. Something like this: > > qemu -drive file=foo.img,cache=writeback/unsafe > > is not safe against power losses also? cache=writeback and cache=none are safe, cache=unsafe isn't. > I think we can add REQ_FLUSH & REQ_FUA support to bio path and that > deserves another patch. You only need to add REQ_FLUSH support. The virtio-blk protocol does not support REQ_FUA, because there's no easy way to do it in userspace. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)
Il 30/07/2012 01:50, Rusty Russell ha scritto: >> Also, being the first user of chained scatterlist doesn't exactly give >> me warm fuzzies. > > We're far from the first user: they've been in the kernel for well over > 7 years. They were introduced for the block layer, but they tended to > ignore private uses of scatterlists like this one. Yeah, but sg_chain has no users in drivers, only a private one in lib/scatterlist.c. The internal API could be changed to something else and leave virtio-scsi screwed... > Yes, we should do this. But note that this means an iteration, so we > might as well combine the loops :) I'm really bad at posting pseudo-code, but you can count the number of physically-contiguous entries at the beginning of the list only. So if everything is contiguous, you use a single non-indirect buffer and save a kmalloc. If you use indirect buffers, I suspect it's much less effective to collapse physically-contiguous entries. More elaborate heuristics do need a loop, though. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH stub] kvm: caching API for interrupts
On Sun, Jul 29, 2012 at 11:00:58PM +0300, Michael S. Tsirkin wrote: > I've been looking at adding caching for IRQs so that we don't need to > scan all VCPUs on each interrupt. One issue I had a problem with, is > how the cache structure can be used from both a thread (to fill out the > cache) and interrupt (to actually send if cache is valid). I do not get this. Cache should be fill out at the time of use. > > For now just added a lock field in the cache so we don't need to worry > about this, and with such a lock in place we don't have to worry about > RCU as cache can be invalidated simply under this lock. Cache should be invalidated at the time of use if irq routing changed. > > For now this just declares the structure and updates the APIs > so it's not intended to be applied, but just to give you > the idea. > > Comments, flames wellcome. > There is no enough meat to discuss here I am afraid. > Signed-off-by: Michael S. Tsirkin > > -- > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 96c158a..3384b39 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -214,11 +214,28 @@ static inline unsigned long > kvm_dirty_bitmap_bytes(struct kvm_memory_slot *memsl > return ALIGN(memslot->npages, BITS_PER_LONG) / 8; > } > > +struct kvm_irq_cache { > + spinlock_t lock; > + bool valid; > + struct kvm_vcpu *dest; > + /* For now we only cache lapic irqs */ > + struct kvm_lapic_irq irq; > + /* Protected by kvm->irq_cache_lock */ > + struct list_head list; Why are you storing them in the list? This should be embedded structure. > +}; > + > +int kvm_set_irq_and_cache(struct kvm *kvm, int irq_source_id, u32 irq, int > level, > + struct kvm_irq_cache *cache); > +int kvm_set_irq_cached(struct kvm *kvm, struct kvm_irq_cache *cache); > +void kvm_irq_cache_init(struct kvm *kvm, struct kvm_irq_cache *cache); > +void kvm_irq_cache_cleanup(struct kvm_irq_cache *cache); > + > struct kvm_kernel_irq_routing_entry { > u32 gsi; > u32 type; > int (*set)(struct kvm_kernel_irq_routing_entry *e, > -struct kvm *kvm, int irq_source_id, int level); > +struct kvm *kvm, int irq_source_id, int level, > +struct kvm_irq_cache *cache); > union { > struct { > unsigned irqchip; > @@ -304,6 +321,8 @@ struct kvm { > struct kvm_irq_routing_table __rcu *irq_routing; > struct hlist_head mask_notifier_list; > struct hlist_head irq_ack_notifier_list; > + struct mutex irq_cache_lock; > + struct list_head irq_cache_list; > #endif > > #ifdef KVM_ARCH_WANT_MMU_NOTIFIER > @@ -617,7 +636,8 @@ void kvm_get_intr_delivery_bitmask(struct kvm_ioapic > *ioapic, > #endif > int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level); > int kvm_set_msi(struct kvm_kernel_irq_routing_entry *irq_entry, struct kvm > *kvm, > - int irq_source_id, int level); > + int irq_source_id, int level, > + struct kvm_irq_cache *cache); > void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin); > void kvm_register_irq_ack_notifier(struct kvm *kvm, > struct kvm_irq_ack_notifier *kian); > diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c > index 7d7e2aa..9622076 100644 > --- a/virt/kvm/eventfd.c > +++ b/virt/kvm/eventfd.c > @@ -138,7 +138,7 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, > void *key) > irq = rcu_dereference(irqfd->irq_entry); > /* An event has been signaled, inject an interrupt */ > if (irq) > - kvm_set_msi(irq, kvm, KVM_USERSPACE_IRQ_SOURCE_ID, 1); > + kvm_set_msi(irq, kvm, KVM_USERSPACE_IRQ_SOURCE_ID, 1, > NULL); > else > schedule_work(&irqfd->inject); > rcu_read_unlock(); > diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c > index 5afb431..d92f3d3 100644 > --- a/virt/kvm/irq_comm.c > +++ b/virt/kvm/irq_comm.c > @@ -46,7 +46,8 @@ static inline int kvm_irq_line_state(unsigned long > *irq_state, > } > > static int kvm_set_pic_irq(struct kvm_kernel_irq_routing_entry *e, > -struct kvm *kvm, int irq_source_id, int level) > +struct kvm *kvm, int irq_source_id, int level, > +struct kvm_irq_cache *cache) > { > #ifdef CONFIG_X86 > struct kvm_pic *pic = pic_irqchip(kvm); > @@ -59,7 +60,8 @@ static int kvm_set_pic_irq(struct > kvm_kernel_irq_routing_entry *e, > } > > static int kvm_set_ioapic_irq(struct kvm_kernel_irq_routing_entry *e, > - struct kvm *kvm, int irq_source_id, int level) > + struct kvm *kvm, int irq_source_id, int level, > + struct kvm_irq_cache *cache) > { > struct kvm_ioapic *ioa