RE: [Qemu-devel] [PATCH v4] Fixes related to processing of qemu's -numa option
Thanks Eduardo ! Hi Anthony, If you are ok with this patch...could you pl pull these changes into upstream (or) suggest who I should talk to get these changes in ? Thanks! Vinod -Original Message- From: Eduardo Habkost [mailto:ehabk...@redhat.com] Sent: Wednesday, July 18, 2012 10:15 AM To: Vinod, Chegu Cc: qemu-de...@nongnu.org; aligu...@us.ibm.com; kvm@vger.kernel.org Subject: Re: [Qemu-devel] [PATCH v4] Fixes related to processing of qemu's -numa option On Mon, Jul 16, 2012 at 09:31:30PM -0700, Chegu Vinod wrote: Changes since v3: - using bitmap_set() instead of set_bit() in numa_add() routine. - removed call to bitmak_zero() since bitmap_new() also zeros' the bitmap. - Rebased to the latest qemu. Tested-by: Eduardo Habkost ehabk...@redhat.com Reviewed-by: Eduardo Habkost ehabk...@redhat.com Changes since v2: - Using unsigned long * for the node_cpumask[]. - Use bitmap_new() instead of g_malloc0() for allocation. - Don't rely on max_cpus since it may not be initialized before the numa related qemu options are parsed processed. Note: Continuing to use a new constant for allocation of the mask (This constant is currently set to 255 since with an 8bit APIC ID VCPUs can range from 0-254 in a guest. The APIC ID 255 (0xFF) is reserved for broadcast). Changes since v1: - Use bitmap functions that are already in qemu (instead of cpu_set_t macro's from sched.h) - Added a check for endvalue = max_cpus. - Fix to address the round-robbing assignment when cpu's are not explicitly specified. --- v1: -- The -numa option to qemu is used to create [fake] numa nodes and expose them to the guest OS instance. There are a couple of issues with the -numa option: a) Max VCPU's that can be specified for a guest while using the qemu's -numa option is 64. Due to a typecasting issue when the number of VCPUs is 32 the VCPUs don't show up under the specified [fake] numa nodes. b) KVM currently has support for 160VCPUs per guest. The qemu's -numa option has only support for upto 64VCPUs per guest. This patch addresses these two issues. Below are examples of (a) and (b) a) 32 VCPUs are specified with the -numa option: /usr/local/bin/qemu-system-x86_64 \ -enable-kvm \ 71:01:01 \ -net tap,ifname=tap0,script=no,downscript=no \ -vnc :4 ... Upstream qemu : -- QEMU 1.1.50 monitor - type 'help' for more information (qemu) info numa 6 nodes node 0 cpus: 0 1 2 3 4 5 6 7 8 9 32 33 34 35 36 37 38 39 40 41 node 0 size: 131072 MB node 1 cpus: 10 11 12 13 14 15 16 17 18 19 42 43 44 45 46 47 48 49 50 51 node 1 size: 131072 MB node 2 cpus: 20 21 22 23 24 25 26 27 28 29 52 53 54 55 56 57 58 59 node 2 size: 131072 MB node 3 cpus: 30 node 3 size: 131072 MB node 4 cpus: node 4 size: 131072 MB node 5 cpus: 31 node 5 size: 131072 MB With the patch applied : --- QEMU 1.1.50 monitor - type 'help' for more information (qemu) info numa 6 nodes node 0 cpus: 0 1 2 3 4 5 6 7 8 9 node 0 size: 131072 MB node 1 cpus: 10 11 12 13 14 15 16 17 18 19 node 1 size: 131072 MB node 2 cpus: 20 21 22 23 24 25 26 27 28 29 node 2 size: 131072 MB node 3 cpus: 30 31 32 33 34 35 36 37 38 39 node 3 size: 131072 MB node 4 cpus: 40 41 42 43 44 45 46 47 48 49 node 4 size: 131072 MB node 5 cpus: 50 51 52 53 54 55 56 57 58 59 node 5 size: 131072 MB b) 64 VCPUs specified with -numa option: /usr/local/bin/qemu-system-x86_64 \ -enable-kvm \ -cpu Westmere,+rdtscp,+pdpe1gb,+dca,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl ,+monitor,+dtes64,+pclmuldq,+pbe,+tm,+ht,+ss,+acpi,+d-vnc :4 ... Upstream qemu : -- only 63 CPUs in NUMA mode supported. only 64 CPUs in NUMA mode supported. QEMU 1.1.50 monitor - type 'help' for more information (qemu) info numa 8 nodes node 0 cpus: 6 7 8 9 38 39 40 41 70 71 72 73 node 0 size: 65536 MB node 1 cpus: 10 11 12 13 14 15 16 17 18 19 42 43 44 45 46 47 48 49 50 51 74 75 76 77 78 79 node 1 size: 65536 MB node 2 cpus: 20 21 22 23 24 25 26 27 28 29 52 53 54 55 56 57 58 59 60 61 node 2 size: 65536 MB node 3 cpus: 30 62 node 3 size: 65536 MB node 4 cpus: node 4 size: 65536 MB node 5 cpus: node 5 size: 65536 MB node 6 cpus: 31 63 node 6 size: 65536 MB node 7 cpus: 0 1 2 3 4 5 32 33 34 35 36 37 64 65 66 67 68 69 node 7 size: 65536 MB With the patch applied : --- QEMU 1.1.50 monitor - type 'help' for more information (qemu) info numa 8 nodes node 0 cpus: 0 1 2 3 4 5 6 7 8 9 node 0 size: 65536 MB node 1 cpus: 10 11 12 13 14 15 16 17 18 19 node 1 size: 65536 MB node 2 cpus: 20 21 22 23 24 25 26 27 28 29 node 2 size: 65536 MB node 3 cpus: 30 31 32 33 34 35 36 37 38 39 node 3 size: 65536 MB node 4 cpus: 40 41 42 43 44 45 46 47 48 49 node 4 size: 65536 MB node 5 cpus: 50 51 52 53 54 55 56 57 58 59 node 5 size:
Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
On Wed, Jul 18, 2012 at 06:58:24PM +0300, Michael S. Tsirkin wrote: Back to original point though current situation is that calling kvm_set_irq() under spinlock is not worse for scalability than calling it not under one. Yes. Still the specific use can just use an atomic flag, lock+bool is not needed, and we won't need to undo it later. Actually, no, replacing it with an atomic is racy. CPU0 (inject) CPU1 (EOI) atomic_cmpxchg(asserted, 0, 1) atomic_cmpxchg(asserted, 1, 0) kvm_set_irq(0) kvm_set_irq(1) eventfd_signal The interrupt is now stuck on until another interrupt is injected. Well EOI somehow happened here before interrupt so it's a bug somewhere else? Interrupts can be shared. We also can't guarantee that the guest won't write a bogus EOI to the ioapic. The irq ack notifier doesn't filter on irq source id... I'm not sure it can. I guess if Avi OKs adding another kvm_set_irq under spinlock that's the best we can do for now. Why can't a mutex be used instead of a spinlock again? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
On Wed, Jul 18, 2012 at 03:42:09PM -0300, Marcelo Tosatti wrote: On Wed, Jul 18, 2012 at 06:58:24PM +0300, Michael S. Tsirkin wrote: Back to original point though current situation is that calling kvm_set_irq() under spinlock is not worse for scalability than calling it not under one. Yes. Still the specific use can just use an atomic flag, lock+bool is not needed, and we won't need to undo it later. Actually, no, replacing it with an atomic is racy. CPU0 (inject) CPU1 (EOI) atomic_cmpxchg(asserted, 0, 1) atomic_cmpxchg(asserted, 1, 0) kvm_set_irq(0) kvm_set_irq(1) eventfd_signal The interrupt is now stuck on until another interrupt is injected. Well EOI somehow happened here before interrupt so it's a bug somewhere else? Interrupts can be shared. We also can't guarantee that the guest won't write a bogus EOI to the ioapic. The irq ack notifier doesn't filter on irq source id... I'm not sure it can. I guess if Avi OKs adding another kvm_set_irq under spinlock that's the best we can do for now. Why can't a mutex be used instead of a spinlock again? Why was it changed at the first place? Commit says that the function is called from unsleepable context, but no stack trace. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v9 11/27] virtio-blk: Indirect vring and flush support
On Wed, Jul 18, 2012 at 04:07:38PM +0100, Stefan Hajnoczi wrote: RHEL6 and other new guest kernels use indirect vring descriptors to increase the number of requests that can be batched. This fundamentally changes vring from a scheme that requires fixed resources to something more dynamic (although there is still an absolute maximum number of descriptors). Cope with indirect vrings by taking on as many requests as we can in one go and then postponing the remaining requests until the first batch completes. It would be possible to switch to dynamic resource management so iovec and iocb structs are malloced. This would allow the entire ring to be processed even with indirect descriptors, but would probably hit a bottleneck when io_submit refuses to queue more requests. Therefore, stick with the simpler scheme for now. Unfortunately Linux AIO does not support asynchronous fsync/fdatasync on all files. In particular, an O_DIRECT opened file on ext4 does not support Linux AIO fdsync. Work around this by performing fdatasync() synchronously for now. Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com --- hw/dataplane/ioq.h | 18 - hw/dataplane/vring.h | 103 +++--- hw/virtio-blk.c | 75 ++-- 3 files changed, 144 insertions(+), 52 deletions(-) diff --git a/hw/dataplane/ioq.h b/hw/dataplane/ioq.h index 7200e87..d1545d6 100644 --- a/hw/dataplane/ioq.h +++ b/hw/dataplane/ioq.h @@ -3,7 +3,7 @@ typedef struct { int fd; /* file descriptor */ -unsigned int max_reqs; /* max length of freelist and queue */ +unsigned int max_reqs; /* max length of freelist and queue */ io_context_t io_ctx;/* Linux AIO context */ EventNotifier io_notifier; /* Linux AIO eventfd */ @@ -91,18 +91,16 @@ static struct iocb *ioq_rdwr(IOQueue *ioq, bool read, struct iovec *iov, unsigne return iocb; } -static struct iocb *ioq_fdsync(IOQueue *ioq) -{ -struct iocb *iocb = ioq_get_iocb(ioq); - -io_prep_fdsync(iocb, ioq-fd); -io_set_eventfd(iocb, event_notifier_get_fd(ioq-io_notifier)); -return iocb; -} - static int ioq_submit(IOQueue *ioq) { int rc = io_submit(ioq-io_ctx, ioq-queue_idx, ioq-queue); +if (unlikely(rc 0)) { +unsigned int i; +fprintf(stderr, io_submit io_ctx=%#lx nr=%d iovecs=%p\n, (uint64_t)ioq-io_ctx, ioq-queue_idx, ioq-queue); +for (i = 0; i ioq-queue_idx; i++) { +fprintf(stderr, [%u] type=%#x fd=%d\n, i, ioq-queue[i]-aio_lio_opcode, ioq-queue[i]-aio_fildes); +} +} ioq-queue_idx = 0; /* reset */ return rc; } diff --git a/hw/dataplane/vring.h b/hw/dataplane/vring.h index 70675e5..3eab4b4 100644 --- a/hw/dataplane/vring.h +++ b/hw/dataplane/vring.h @@ -64,6 +64,86 @@ static void vring_setup(Vring *vring, VirtIODevice *vdev, int n) vring-vr.desc, vring-vr.avail, vring-vr.used); } +static bool vring_more_avail(Vring *vring) +{ + return vring-vr.avail-idx != vring-last_avail_idx; +} + +/* This is stolen from linux-2.6/drivers/vhost/vhost.c. */ So add a Red Hat copyright pls. +static bool get_indirect(Vring *vring, + struct iovec iov[], struct iovec *iov_end, + unsigned int *out_num, unsigned int *in_num, + struct vring_desc *indirect) +{ + struct vring_desc desc; + unsigned int i = 0, count, found = 0; + + /* Sanity check */ + if (unlikely(indirect-len % sizeof desc)) { + fprintf(stderr, Invalid length in indirect descriptor: +len 0x%llx not multiple of 0x%zx\n, +(unsigned long long)indirect-len, +sizeof desc); + exit(1); + } + + count = indirect-len / sizeof desc; + /* Buffers are chained via a 16 bit next field, so + * we can have at most 2^16 of these. */ + if (unlikely(count USHRT_MAX + 1)) { + fprintf(stderr, Indirect buffer length too big: %d\n, +indirect-len); +exit(1); + } + +/* Point to translate indirect desc chain */ +indirect = phys_to_host(vring, indirect-addr); + + /* We will use the result as an address to read from, so most + * architectures only need a compiler barrier here. */ + __sync_synchronize(); /* read_barrier_depends(); */ + + do { + if (unlikely(++found count)) { + fprintf(stderr, Loop detected: last one at %u +indirect size %u\n, +i, count); + exit(1); + } + +desc = *indirect++; + if (unlikely(desc.flags VRING_DESC_F_INDIRECT)) { + fprintf(stderr, Nested indirect
Re: [RFC v9 12/27] virtio-blk: Add workaround for BUG_ON() dependency in virtio_ring.h
On Wed, Jul 18, 2012 at 04:07:39PM +0100, Stefan Hajnoczi wrote: Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com --- hw/dataplane/vring.h |5 + 1 file changed, 5 insertions(+) diff --git a/hw/dataplane/vring.h b/hw/dataplane/vring.h index 3eab4b4..44ef4a9 100644 --- a/hw/dataplane/vring.h +++ b/hw/dataplane/vring.h @@ -1,6 +1,11 @@ #ifndef VRING_H #define VRING_H +/* Some virtio_ring.h files use BUG_ON() */ It's a bug then. Do we really need to work around broken systems? If yes let's just ship our own headers ... +#ifndef BUG_ON +#define BUG_ON(x) +#endif + #include linux/virtio_ring.h #include qemu-common.h -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v9 22/27] virtio-blk: Fix request merging
On Wed, Jul 18, 2012 at 04:07:49PM +0100, Stefan Hajnoczi wrote: Khoa Huynh k...@us.ibm.com discovered that request merging is broken. The merged iocb is not updated to reflect the total number of iovecs and the offset is also outdated. This patch fixes request merging. Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com So all these fixups need to be folded in making it correct first time. --- hw/virtio-blk.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c index 9131a7a..51807b5 100644 --- a/hw/virtio-blk.c +++ b/hw/virtio-blk.c @@ -178,13 +178,17 @@ static void merge_request(struct iocb *iocb_a, struct iocb *iocb_b) req_a-len = iocb_nbytes(iocb_a); } -iocb_b-u.v.vec = iovec; -req_b-len = iocb_nbytes(iocb_b); -req_b-next_merged = req_a; /* fprintf(stderr, merged %p (%u) and %p (%u), %u iovecs in total\n, req_a, iocb_a-u.v.nr, req_b, iocb_b-u.v.nr, iocb_a-u.v.nr + iocb_b-u.v.nr); */ + +iocb_b-u.v.vec = iovec; +iocb_b-u.v.nr += iocb_a-u.v.nr; +iocb_b-u.v.offset = iocb_a-u.v.offset; + +req_b-len = iocb_nbytes(iocb_b); +req_b-next_merged = req_a; } static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_num, unsigned int in_num, unsigned int head) -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v9 23/27] virtio-blk: Stub out SCSI commands
On Wed, Jul 18, 2012 at 04:07:50PM +0100, Stefan Hajnoczi wrote: Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com Why? --- hw/virtio-blk.c | 25 + 1 file changed, 17 insertions(+), 8 deletions(-) diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c index 51807b5..8734029 100644 --- a/hw/virtio-blk.c +++ b/hw/virtio-blk.c @@ -215,14 +215,8 @@ static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_n /* TODO Linux sets the barrier bit even when not advertised! */ uint32_t type = outhdr-type ~VIRTIO_BLK_T_BARRIER; - -if (unlikely(type ~(VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH))) { -fprintf(stderr, virtio-blk unsupported request type %#x\n, outhdr-type); -exit(1); -} - struct iocb *iocb; -switch (type (VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH)) { +switch (type (VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_SCSI_CMD | VIRTIO_BLK_T_FLUSH)) { case VIRTIO_BLK_T_IN: if (unlikely(out_num != 1)) { fprintf(stderr, virtio-blk invalid read request\n); @@ -239,6 +233,21 @@ static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_n iocb = ioq_rdwr(ioq, false, iov[1], out_num - 1, outhdr-sector * 512UL); /* TODO is it always 512? */ break; +case VIRTIO_BLK_T_SCSI_CMD: +if (unlikely(in_num == 0)) { +fprintf(stderr, virtio-blk invalid SCSI command request\n); +exit(1); +} + +/* TODO support SCSI commands */ +{ +VirtIOBlock *s = container_of(ioq, VirtIOBlock, ioqueue); +inhdr-status = VIRTIO_BLK_S_UNSUPP; +vring_push(s-vring, head, sizeof *inhdr); +virtio_blk_notify_guest(s); +} +return; + case VIRTIO_BLK_T_FLUSH: if (unlikely(in_num != 1 || out_num != 1)) { fprintf(stderr, virtio-blk invalid flush request\n); @@ -256,7 +265,7 @@ static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int out_n return; default: -fprintf(stderr, virtio-blk multiple request type bits set\n); +fprintf(stderr, virtio-blk unsupported request type %#x\n, outhdr-type); exit(1); } -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
On Wed, 2012-07-18 at 15:42 -0300, Marcelo Tosatti wrote: On Wed, Jul 18, 2012 at 06:58:24PM +0300, Michael S. Tsirkin wrote: Back to original point though current situation is that calling kvm_set_irq() under spinlock is not worse for scalability than calling it not under one. Yes. Still the specific use can just use an atomic flag, lock+bool is not needed, and we won't need to undo it later. Actually, no, replacing it with an atomic is racy. CPU0 (inject) CPU1 (EOI) atomic_cmpxchg(asserted, 0, 1) atomic_cmpxchg(asserted, 1, 0) kvm_set_irq(0) kvm_set_irq(1) eventfd_signal The interrupt is now stuck on until another interrupt is injected. Well EOI somehow happened here before interrupt so it's a bug somewhere else? Interrupts can be shared. We also can't guarantee that the guest won't write a bogus EOI to the ioapic. The irq ack notifier doesn't filter on irq source id... I'm not sure it can. I guess if Avi OKs adding another kvm_set_irq under spinlock that's the best we can do for now. Why can't a mutex be used instead of a spinlock again? eventfd_signal calls the inject function from atomic context. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC-v2 0/4] tcm_vhost+cmwq fabric driver code for-3.6
On 07/18/2012 11:47 AM, James Bottomley wrote: On Wed, 2012-07-18 at 11:00 -0500, Anthony Liguori wrote: Of course: Think about the consequences: you want to upgrade one array on your SAN. You definitely don't want to shut down your entire data centre to achieve it. In place upgrades on running SANs have been common in enterprise environments for a while. Would firmware upgrades ever result in major OS visible changes though? Maybe OSes are more robust with SCSI than with other types of buses, but I don't think it's safe to completely ignore the problem. I agree that in general, SCSI targets don't need this, but I'm pretty sure that if a guest probes for a command, you migrate to an old version, and that command is no longer there, badness will ensue. What command are we talking about? Operation of initiators is usually just READ and WRITE. So perhaps we might have inline UNMAP ... but the world wouldn't come to an end even if the latter stopped working. Is that true for all OSes? Linux may handle things gracefully if UNMAP starts throwing errors but that doesn't mean that Windows will. There are other cases where this creates problems. Windows (and some other OSes) fingerprint the hardware profile in order to do license enforcement. If the hardware changes beyond a certain amount, then they refuse to boot. Windows does this with a points system and I do believe that INQUIRY responses from any local disks are included in this tally. Most of the complex SCSI stuff is done at start of day; it's actually only then we'd notice things like changes in INQUIRY strings or mode pages. Failover, which is what you're talking about, requires reinstatement of all the operating parameters of the source/target system, but that's not wholly the responsibility of the storage system ... It's the responsibility of the hypervisor when dealing with live migration. Regards, Anthony Liguori James It's different when you're talking about a reboot happening or a disconnect/reconnect due to firmware upgrade. The OS would naturally be reprobing in this case. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
On Wed, 2012-07-18 at 13:07 -0600, Alex Williamson wrote: On Wed, 2012-07-18 at 15:42 -0300, Marcelo Tosatti wrote: On Wed, Jul 18, 2012 at 06:58:24PM +0300, Michael S. Tsirkin wrote: Back to original point though current situation is that calling kvm_set_irq() under spinlock is not worse for scalability than calling it not under one. Yes. Still the specific use can just use an atomic flag, lock+bool is not needed, and we won't need to undo it later. Actually, no, replacing it with an atomic is racy. CPU0 (inject) CPU1 (EOI) atomic_cmpxchg(asserted, 0, 1) atomic_cmpxchg(asserted, 1, 0) kvm_set_irq(0) kvm_set_irq(1) eventfd_signal The interrupt is now stuck on until another interrupt is injected. Well EOI somehow happened here before interrupt so it's a bug somewhere else? Interrupts can be shared. We also can't guarantee that the guest won't write a bogus EOI to the ioapic. The irq ack notifier doesn't filter on irq source id... I'm not sure it can. I guess if Avi OKs adding another kvm_set_irq under spinlock that's the best we can do for now. Why can't a mutex be used instead of a spinlock again? eventfd_signal calls the inject function from atomic context. Actually, that's called from a workq. I'll have to switch it back and turn on lockdep to remember why I couldn't sleep there. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
On Wed, Jul 18, 2012 at 01:13:06PM -0600, Alex Williamson wrote: On Wed, 2012-07-18 at 13:07 -0600, Alex Williamson wrote: On Wed, 2012-07-18 at 15:42 -0300, Marcelo Tosatti wrote: On Wed, Jul 18, 2012 at 06:58:24PM +0300, Michael S. Tsirkin wrote: Back to original point though current situation is that calling kvm_set_irq() under spinlock is not worse for scalability than calling it not under one. Yes. Still the specific use can just use an atomic flag, lock+bool is not needed, and we won't need to undo it later. Actually, no, replacing it with an atomic is racy. CPU0 (inject) CPU1 (EOI) atomic_cmpxchg(asserted, 0, 1) atomic_cmpxchg(asserted, 1, 0) kvm_set_irq(0) kvm_set_irq(1) eventfd_signal The interrupt is now stuck on until another interrupt is injected. Well EOI somehow happened here before interrupt so it's a bug somewhere else? Interrupts can be shared. We also can't guarantee that the guest won't write a bogus EOI to the ioapic. The irq ack notifier doesn't filter on irq source id... I'm not sure it can. I guess if Avi OKs adding another kvm_set_irq under spinlock that's the best we can do for now. Why can't a mutex be used instead of a spinlock again? eventfd_signal calls the inject function from atomic context. Actually, that's called from a workq. I'll have to switch it back and turn on lockdep to remember why I couldn't sleep there. I'll try to fix kvm_set_irq so it returns 0 if level was already 0. Then you do not need extra state. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/8] KVM: Optimize MMU notifier's THP page invalidation -v4
On Mon, Jul 02, 2012 at 05:52:39PM +0900, Takuya Yoshikawa wrote: v3-v4: Resolved trace_kvm_age_page() issue -- patch 6,7 v2-v3: Fixed intersection calculations. -- patch 3, 8 Takuya Takuya Yoshikawa (8): KVM: MMU: Use __gfn_to_rmap() to clean up kvm_handle_hva() KVM: Introduce hva_to_gfn_memslot() for kvm_handle_hva() KVM: MMU: Make kvm_handle_hva() handle range of addresses KVM: Introduce kvm_unmap_hva_range() for kvm_mmu_notifier_invalidate_range_start() KVM: Separate rmap_pde from kvm_lpage_info-write_count KVM: MMU: Add memslot parameter to hva handlers KVM: MMU: Push trace_kvm_age_page() into kvm_age_rmapp() KVM: MMU: Avoid handling same rmap_pde in kvm_handle_hva_range() arch/powerpc/include/asm/kvm_host.h |2 + arch/powerpc/kvm/book3s_64_mmu_hv.c | 47 --- arch/x86/include/asm/kvm_host.h |3 +- arch/x86/kvm/mmu.c | 107 +++ arch/x86/kvm/x86.c | 11 include/linux/kvm_host.h|8 +++ virt/kvm/kvm_main.c |3 +- 7 files changed, 131 insertions(+), 50 deletions(-) From v2: The new test result was impressively good, see below, and THP page invalidation was more than 5 times faster on my x86 machine. Before: ... 19.852 us | __mmu_notifier_invalidate_range_start(); 28.033 us | __mmu_notifier_invalidate_range_start(); 19.066 us | __mmu_notifier_invalidate_range_start(); 44.715 us | __mmu_notifier_invalidate_range_start(); 31.613 us | __mmu_notifier_invalidate_range_start(); 20.659 us | __mmu_notifier_invalidate_range_start(); 19.979 us | __mmu_notifier_invalidate_range_start(); 20.416 us | __mmu_notifier_invalidate_range_start(); 20.632 us | __mmu_notifier_invalidate_range_start(); 22.316 us | __mmu_notifier_invalidate_range_start(); ... After: ... 4.089 us | __mmu_notifier_invalidate_range_start(); 4.096 us | __mmu_notifier_invalidate_range_start(); 3.560 us | __mmu_notifier_invalidate_range_start(); 3.376 us | __mmu_notifier_invalidate_range_start(); 3.772 us | __mmu_notifier_invalidate_range_start(); 3.353 us | __mmu_notifier_invalidate_range_start(); 3.332 us | __mmu_notifier_invalidate_range_start(); 3.332 us | __mmu_notifier_invalidate_range_start(); 3.332 us | __mmu_notifier_invalidate_range_start(); 3.337 us | __mmu_notifier_invalidate_range_start(); ... Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC-v2 0/4] tcm_vhost+cmwq fabric driver code for-3.6
On Jul 18, 2012, at 10:17 AM, Michael S. Tsirkin wrote: snip You do have to pay very close attention to some things however. Don't change the device identity in any way - even version information, otherwise a Windows initiator will blue-screen. I made that mistake myself, so I remember it well. It seemed like such an innocent change. I don't recall there being any issue with adding commands and we did do that on occasion. How about removing commands? Good question. With the storage system I am familiar with, that would only be a risk if firmware were downgraded. Downgrading would never have been recommended. I am sure that if something like persistent reserve suddenly went away it would cause big trouble for some initiators. -- Mark Rustad, LAN Access Division, Intel Corporation -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86, hyper: fix build with !CONFIG_KVM_GUEST
On Wed, Jul 18, 2012 at 11:48:50AM +0300, Avi Kivity wrote: Signed-off-by: Avi Kivity a...@redhat.com --- arch/x86/kernel/cpu/hypervisor.c | 2 ++ 1 file changed, 2 insertions(+) Applied to next, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
On Wed, 2012-07-18 at 13:13 -0600, Alex Williamson wrote: On Wed, 2012-07-18 at 13:07 -0600, Alex Williamson wrote: On Wed, 2012-07-18 at 15:42 -0300, Marcelo Tosatti wrote: On Wed, Jul 18, 2012 at 06:58:24PM +0300, Michael S. Tsirkin wrote: Back to original point though current situation is that calling kvm_set_irq() under spinlock is not worse for scalability than calling it not under one. Yes. Still the specific use can just use an atomic flag, lock+bool is not needed, and we won't need to undo it later. Actually, no, replacing it with an atomic is racy. CPU0 (inject) CPU1 (EOI) atomic_cmpxchg(asserted, 0, 1) atomic_cmpxchg(asserted, 1, 0) kvm_set_irq(0) kvm_set_irq(1) eventfd_signal The interrupt is now stuck on until another interrupt is injected. Well EOI somehow happened here before interrupt so it's a bug somewhere else? Interrupts can be shared. We also can't guarantee that the guest won't write a bogus EOI to the ioapic. The irq ack notifier doesn't filter on irq source id... I'm not sure it can. I guess if Avi OKs adding another kvm_set_irq under spinlock that's the best we can do for now. Why can't a mutex be used instead of a spinlock again? eventfd_signal calls the inject function from atomic context. Actually, that's called from a workq. I'll have to switch it back and turn on lockdep to remember why I couldn't sleep there. switching to a mutex results in: BUG: sleeping function called from invalid context at kernel/mutex.c:269 in_atomic(): 1, irqs_disabled(): 0, pid: 30025, name: qemu-system-x86 INFO: lockdep is turned off. Pid: 30025, comm: qemu-system-x86 Not tainted 3.5.0-rc4+ #109 Call Trace: [81088425] __might_sleep+0xf5/0x130 [81564c6f] mutex_lock_nested+0x2f/0x60 [a07db7d5] eoifd_event+0x25/0x70 [kvm] [a07daea4] kvm_notify_acked_irq+0xa4/0x140 [kvm] [a07dae2a] ? kvm_notify_acked_irq+0x2a/0x140 [kvm] [a07d9bb4] kvm_ioapic_update_eoi+0x84/0xf0 [kvm] [a0806c43] apic_set_eoi+0x123/0x130 [kvm] [a0806fd8] apic_reg_write+0x388/0x670 [kvm] [a07eb03c] ? vcpu_enter_guest+0x32c/0x740 [kvm] [a0807481] kvm_lapic_set_eoi+0x21/0x30 [kvm] [a04ba3f9] handle_apic_access+0x69/0x80 [kvm_intel] [a04ba02a] vmx_handle_exit+0xaa/0x260 [kvm_intel] -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: MMU: Fix mmu_shrink() so that it can free mmu pages as intended
On Thu, Jul 12, 2012 at 06:35:09PM +0900, Takuya Yoshikawa wrote: On Thu, 5 Jul 2012 23:05:46 +0900 Takuya Yoshikawa takuya.yoshik...@gmail.com wrote: On Thu, 5 Jul 2012 14:50:00 +0300 Gleb Natapov g...@redhat.com wrote: Note that if (!nr_to_scan--) check is removed since we do not try to free mmu pages from more than one VM. IIRC this was proposed in the past that we should iterate over vm list until freeing something eventually, but Avi was against it. I think the probability of a VM with kvm-arch.n_used_mmu_pages == 0 is low, so it looks OK to drop nr_to_scan to me. Since our batch size is 128, the minimum positive @nr_to_scan, it's almost impossible to see the effect of the check. Thinking more about this: I think freeing mmu pages by shrink_slab() is problematic. For example, if we do # echo 2 /proc/sys/vm/drop_caches on the host, some mmu pages will be freed. This is not what most people expect, probably. Although this patch is needed to care about shadow paging's extreme mmu page usage, we should do somthing better in the future. What I think reasonable is not treating all mmu pages as freeable: - determine some base number of mmu pages: base_mmu_pages - return (total_mmu_pages - base_mmu_pages) to the caller * We may use n_max_mmu_pages for calculating this base number. By doing so, we can avoid freeing mmu pages especially when EPT/NPT ON. Takuya, Can't understand, can you please expand more clearly? TIA -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] qemu kvm: Recognize PCID feature
On Wed, Jul 18, 2012 at 11:29:36AM +0200, Jan Kiszka wrote: On 2012-07-18 10:44, Mao, Junjie wrote: Hi, Avi Any comments on this patch? :) Always include qemu-devel when your are changing QEMU, qemu-kvm is just staging for the latter. This patch can actually go into upstream directly, maybe even via qemu-trivial as it just makes that flag selectable. Jan Agreed, CCing Stefan. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Latest Centos6.3 kernel, rebooting from inside centos6.3 VM - gets stuck on seabios/grub loop - previous kernels fine
On Wed, Jul 11, 2012 at 10:54:34PM +0100, Morgan Cox wrote: Hi I have an Ubuntu 12.04 KVM server In Centos6 VM's - when I install the latest kernel for centos 6.3 - 2.6.32-279.1.1.el6 - if you reboot from inside a Centos6 vm it gets stuck in a loop between seabios/grub. If i use virsh/virt-manager to reboot its fine, only from inside a centos6 vm (with latest centos kernel) does this occur - I am 100% its connected with the latest kernel (or at least the 6.3 updates). As a test I installed centos 6.2 - this was 100% fine *until* I did a yum update then I got the same issue. If I use the older kernel with all the rest of the Centos 6.3 updates the issue does not occur either (so it 'must'? be related to the kernel update) How can this be fixed ? Morgan, What is the last working kernel RPM version? Can you disable kvmclock? (by appending no-kvmclock to the end of the kernel line of 2.6.32-279.1.1.el6 entry in /boot/grub/menu.lst). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC-v3 3/4] vhost: Add vhost_scsi specific defines
On Wed, 2012-07-18 at 16:05 +0300, Michael S. Tsirkin wrote: On Wed, Jul 18, 2012 at 12:59:31AM +, Nicholas A. Bellinger wrote: From: Nicholas Bellinger n...@risingtidesystems.com This patch adds the initial vhost_scsi_ioctl() callers for VHOST_SCSI_SET_ENDPOINT and VHOST_SCSI_CLEAR_ENDPOINT respectively, and also adds struct vhost_vring_target that is used by tcm_vhost code when locating target ports during qemu setup. Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com Cc: Zhi Yong Wu wu...@cn.ibm.com Cc: Michael S. Tsirkin m...@redhat.com Cc: Paolo Bonzini pbonz...@redhat.com, Signed-off-by: Nicholas A. Bellinger n...@risingtidesystems.com --- include/linux/vhost.h |9 + 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/include/linux/vhost.h b/include/linux/vhost.h index e847f1e..33b313b 100644 --- a/include/linux/vhost.h +++ b/include/linux/vhost.h @@ -24,7 +24,11 @@ struct vhost_vring_state { struct vhost_vring_file { unsigned int index; int fd; /* Pass -1 to unbind from file. */ +}; +struct vhost_vring_target { Can this be renamed vhost_scsi_target? Done + unsigned char vhost_wwpn[224]; 224? I am guessing ISCSI_NAME_LEN from include/scsi/iscsi_proto.h? Unfortunately we can't include iscsi_proto.h here as it is not exported to users. But let's add a comment for now. This is actually from target/target_core_base.h:TRANSPORT_IQN_LEN. Fixing this up now.. + unsigned short vhost_tpgt; }; struct vhost_vring_addr { @@ -121,6 +125,11 @@ struct vhost_memory { * device. This can be used to stop the ring (e.g. for migration). */ #define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct vhost_vring_file) +/* VHOST_SCSI specific defines */ + +#define VHOST_SCSI_SET_ENDPOINT _IOW(VHOST_VIRTIO, 0x40, struct vhost_vring_target) +#define VHOST_SCSI_CLEAR_ENDPOINT _IOW(VHOST_VIRTIO, 0x41, struct vhost_vring_target) + /* Feature bits */ /* Log all write descriptors. Can be changed while device is active. */ Can these go into appropriate ifdef CONFIG_TCP_VHOST please? M, I don't think we can do that with CONFIG_TCM_VHOST=m, or at least not with the following patch diff --git a/include/linux/vhost.h b/include/linux/vhost.h index 33b313b..e4b1ee3 100644 --- a/include/linux/vhost.h +++ b/include/linux/vhost.h @@ -125,10 +126,14 @@ struct vhost_memory { * device. This can be used to stop the ring (e.g. for migration). */ #define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct vhost_vring_file) +#ifdef CONFIG_TCM_VHOST + /* VHOST_SCSI specific defines */ -#define VHOST_SCSI_SET_ENDPOINT _IOW(VHOST_VIRTIO, 0x40, struct vhost_vring_target) -#define VHOST_SCSI_CLEAR_ENDPOINT _IOW(VHOST_VIRTIO, 0x41, struct vhost_vring_target) +# define VHOST_SCSI_SET_ENDPOINT _IOW(VHOST_VIRTIO, 0x40, struct vhost_scsi_target) +# define VHOST_SCSI_CLEAR_ENDPOINT _IOW(VHOST_VIRTIO, 0x41, struct vhost_scsi_target) + +#endif -- drivers/vhost/tcm/tcm_vhost.c: In function ‘vhost_scsi_ioctl’: drivers/vhost/tcm/tcm_vhost.c:970: error: ‘VHOST_SCSI_SET_ENDPOINT’ undeclared (first use in this function) drivers/vhost/tcm/tcm_vhost.c:970: error: (Each undeclared identifier is reported only once drivers/vhost/tcm/tcm_vhost.c:970: error: for each function it appears in.) drivers/vhost/tcm/tcm_vhost.c:975: error: ‘VHOST_SCSI_CLEAR_ENDPOINT’ undeclared (first use in this function) make[3]: *** [drivers/vhost/tcm/tcm_vhost.o] Error 1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC-v3 4/4] tcm_vhost: Initial merge for vhost level target fabric driver
On Wed, 2012-07-18 at 19:09 +0300, Michael S. Tsirkin wrote: On Wed, Jul 18, 2012 at 12:59:32AM +, Nicholas A. Bellinger wrote: SNIP Changelog v2 - v3: Unlock on error in tcm_vhost_drop_nexus() (DanC) Fix strlen() doesn't count the terminator (DanC) Call kfree() on an error path (DanC) Convert tcm_vhost_write_pending to use target_execute_cmd (hch + nab) Fix another strlen() off by one in tcm_vhost_make_tport (DanC) Add option under drivers/staging/Kconfig, and move to drivers/vhost/tcm/ as requested by MST (nab) --- drivers/staging/Kconfig |2 + drivers/vhost/Makefile|2 + drivers/vhost/tcm/Kconfig |6 + drivers/vhost/tcm/Makefile|1 + drivers/vhost/tcm/tcm_vhost.c | 1611 + drivers/vhost/tcm/tcm_vhost.h | 74 ++ 6 files changed, 1696 insertions(+), 0 deletions(-) create mode 100644 drivers/vhost/tcm/Kconfig create mode 100644 drivers/vhost/tcm/Makefile create mode 100644 drivers/vhost/tcm/tcm_vhost.c create mode 100644 drivers/vhost/tcm/tcm_vhost.h Really sorry about making you run around like that, I did not mean moving all of tcm to a directory, just adding tcm/Kconfig or adding drivers/vhost/Kconfig.tcm because eventually it's easier to keep it all together in one place. Er, apologies for the slight mis-understanding here.. Moving back now + fixing up the Kbuild bits. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC-v3 3/4] vhost: Add vhost_scsi specific defines
On Wed, Jul 18, 2012 at 02:17:05PM -0700, Nicholas A. Bellinger wrote: On Wed, 2012-07-18 at 16:05 +0300, Michael S. Tsirkin wrote: On Wed, Jul 18, 2012 at 12:59:31AM +, Nicholas A. Bellinger wrote: From: Nicholas Bellinger n...@risingtidesystems.com This patch adds the initial vhost_scsi_ioctl() callers for VHOST_SCSI_SET_ENDPOINT and VHOST_SCSI_CLEAR_ENDPOINT respectively, and also adds struct vhost_vring_target that is used by tcm_vhost code when locating target ports during qemu setup. Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com Cc: Zhi Yong Wu wu...@cn.ibm.com Cc: Michael S. Tsirkin m...@redhat.com Cc: Paolo Bonzini pbonz...@redhat.com, Signed-off-by: Nicholas A. Bellinger n...@risingtidesystems.com --- include/linux/vhost.h |9 + 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/include/linux/vhost.h b/include/linux/vhost.h index e847f1e..33b313b 100644 --- a/include/linux/vhost.h +++ b/include/linux/vhost.h @@ -24,7 +24,11 @@ struct vhost_vring_state { struct vhost_vring_file { unsigned int index; int fd; /* Pass -1 to unbind from file. */ +}; +struct vhost_vring_target { Can this be renamed vhost_scsi_target? Done + unsigned char vhost_wwpn[224]; 224? I am guessing ISCSI_NAME_LEN from include/scsi/iscsi_proto.h? Unfortunately we can't include iscsi_proto.h here as it is not exported to users. But let's add a comment for now. This is actually from target/target_core_base.h:TRANSPORT_IQN_LEN. Fixing this up now.. + unsigned short vhost_tpgt; }; struct vhost_vring_addr { @@ -121,6 +125,11 @@ struct vhost_memory { * device. This can be used to stop the ring (e.g. for migration). */ #define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct vhost_vring_file) +/* VHOST_SCSI specific defines */ + +#define VHOST_SCSI_SET_ENDPOINT _IOW(VHOST_VIRTIO, 0x40, struct vhost_vring_target) +#define VHOST_SCSI_CLEAR_ENDPOINT _IOW(VHOST_VIRTIO, 0x41, struct vhost_vring_target) + /* Feature bits */ /* Log all write descriptors. Can be changed while device is active. */ Can these go into appropriate ifdef CONFIG_TCP_VHOST please? M, I don't think we can do that with CONFIG_TCM_VHOST=m, or at least not with the following patch diff --git a/include/linux/vhost.h b/include/linux/vhost.h index 33b313b..e4b1ee3 100644 --- a/include/linux/vhost.h +++ b/include/linux/vhost.h @@ -125,10 +126,14 @@ struct vhost_memory { * device. This can be used to stop the ring (e.g. for migration). */ #define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct vhost_vring_file) +#ifdef CONFIG_TCM_VHOST + /* VHOST_SCSI specific defines */ -#define VHOST_SCSI_SET_ENDPOINT _IOW(VHOST_VIRTIO, 0x40, struct vhost_vring_target) -#define VHOST_SCSI_CLEAR_ENDPOINT _IOW(VHOST_VIRTIO, 0x41, struct vhost_vring_target) +# define VHOST_SCSI_SET_ENDPOINT _IOW(VHOST_VIRTIO, 0x40, struct vhost_scsi_target) +# define VHOST_SCSI_CLEAR_ENDPOINT _IOW(VHOST_VIRTIO, 0x41, struct vhost_scsi_target) + +#endif -- drivers/vhost/tcm/tcm_vhost.c: In function ‘vhost_scsi_ioctl’: drivers/vhost/tcm/tcm_vhost.c:970: error: ‘VHOST_SCSI_SET_ENDPOINT’ undeclared (first use in this function) drivers/vhost/tcm/tcm_vhost.c:970: error: (Each undeclared identifier is reported only once drivers/vhost/tcm/tcm_vhost.c:970: error: for each function it appears in.) drivers/vhost/tcm/tcm_vhost.c:975: error: ‘VHOST_SCSI_CLEAR_ENDPOINT’ undeclared (first use in this function) make[3]: *** [drivers/vhost/tcm/tcm_vhost.o] Error 1 Maybe ifdefs only work for booleans? If yes you can probably add a boolean and select it? What I want to prevent is exposing tcm stuff in the header if it is configured off. I'll play with it tomorrow. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
On Wed, Jul 18, 2012 at 02:28:34PM -0600, Alex Williamson wrote: turn on lockdep to remember why I couldn't sleep there. switching to a mutex results in: BUG: sleeping function called from invalid context at kernel/mutex.c:269 in_atomic(): 1, irqs_disabled(): 0, pid: 30025, name: qemu-system-x86 INFO: lockdep is turned off. Pid: 30025, comm: qemu-system-x86 Not tainted 3.5.0-rc4+ #109 Call Trace: [81088425] __might_sleep+0xf5/0x130 [81564c6f] mutex_lock_nested+0x2f/0x60 [a07db7d5] eoifd_event+0x25/0x70 [kvm] [a07daea4] kvm_notify_acked_irq+0xa4/0x140 [kvm] [a07dae2a] ? kvm_notify_acked_irq+0x2a/0x140 [kvm] [a07d9bb4] kvm_ioapic_update_eoi+0x84/0xf0 [kvm] [a0806c43] apic_set_eoi+0x123/0x130 [kvm] [a0806fd8] apic_reg_write+0x388/0x670 [kvm] [a07eb03c] ? vcpu_enter_guest+0x32c/0x740 [kvm] [a0807481] kvm_lapic_set_eoi+0x21/0x30 [kvm] [a04ba3f9] handle_apic_access+0x69/0x80 [kvm_intel] [a04ba02a] vmx_handle_exit+0xaa/0x260 [kvm_intel] Its RCU from ack notifiers, OK. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
On Wed, Jul 18, 2012 at 06:23:34PM -0300, Marcelo Tosatti wrote: On Wed, Jul 18, 2012 at 02:28:34PM -0600, Alex Williamson wrote: turn on lockdep to remember why I couldn't sleep there. switching to a mutex results in: BUG: sleeping function called from invalid context at kernel/mutex.c:269 in_atomic(): 1, irqs_disabled(): 0, pid: 30025, name: qemu-system-x86 INFO: lockdep is turned off. Pid: 30025, comm: qemu-system-x86 Not tainted 3.5.0-rc4+ #109 Call Trace: [81088425] __might_sleep+0xf5/0x130 [81564c6f] mutex_lock_nested+0x2f/0x60 [a07db7d5] eoifd_event+0x25/0x70 [kvm] [a07daea4] kvm_notify_acked_irq+0xa4/0x140 [kvm] [a07dae2a] ? kvm_notify_acked_irq+0x2a/0x140 [kvm] [a07d9bb4] kvm_ioapic_update_eoi+0x84/0xf0 [kvm] [a0806c43] apic_set_eoi+0x123/0x130 [kvm] [a0806fd8] apic_reg_write+0x388/0x670 [kvm] [a07eb03c] ? vcpu_enter_guest+0x32c/0x740 [kvm] [a0807481] kvm_lapic_set_eoi+0x21/0x30 [kvm] [a04ba3f9] handle_apic_access+0x69/0x80 [kvm_intel] [a04ba02a] vmx_handle_exit+0xaa/0x260 [kvm_intel] Its RCU from ack notifiers, OK. I'm testing a patch that moves all bitmap handling to under pic/ioapic lock. After this we can teach kvm_set_irq to report when a bit that is cleared/set is already clear/set, without races. And then no tracking will be necessary in irqfd - we can just call kvm_set_irq(..., 0) and look at the return status. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm: fix race with level interrupts
When more than 1 source id is in use for the same GSI, we have the following race related to handling irq_states: CPU 0 clears bit in irq_states. CPU 0 reads irq_state as 0. CPU 1 sets bit in irq_states. CPU 1 calls kvm_ioapic_set_irq(1). CPU 0 calls kvm_ioapic_set_irq(0). Now ioapic thinks the level is 0 but irq_state is not 0. Note that above is valid behaviour if CPU0 and CPU1 are using different source ids. Fix by performing all irq_states bitmap handling under pic/ioapic lock. This also removes the need for atomics with irq_states handling. Reported-by: Gleb Natapov g...@redhat.com Signed-off-by: Michael S. Tsirkin m...@redhat.com --- This survives stress testing for me (though I only tried virtio, not device assignment). Avi, Marcelo, though we have not observed this in the field, it's a bugfix so probably 3.5 material? I assume yes so the patch is against 3.5-rc7. Also stable? It's a very old bug. arch/x86/include/asm/kvm_host.h | 15 ++- arch/x86/kvm/i8259.c| 14 -- virt/kvm/ioapic.c | 13 - virt/kvm/ioapic.h | 4 +++- virt/kvm/irq_comm.c | 31 --- 5 files changed, 45 insertions(+), 32 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index db7c1f2..fe6e9f1 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -802,7 +802,20 @@ int kvm_read_guest_page_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault); bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl); -int kvm_pic_set_irq(void *opaque, int irq, int level); +static inline int kvm_irq_line_state(unsigned long *irq_state, +int irq_source_id, int level) +{ + /* Logical OR for level trig interrupt */ + if (level) + __set_bit(irq_source_id, irq_state); + else + __clear_bit(irq_source_id, irq_state); + + return !!(*irq_state); +} + +int kvm_pic_set_irq(struct kvm_pic *pic, int irq, int irq_source_id, int level); +void kvm_pic_clear_all(struct kvm_pic *pic, int irq_source_id); void kvm_inject_nmi(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c index 81cf4fa..95b504b 100644 --- a/arch/x86/kvm/i8259.c +++ b/arch/x86/kvm/i8259.c @@ -188,13 +188,14 @@ void kvm_pic_update_irq(struct kvm_pic *s) pic_unlock(s); } -int kvm_pic_set_irq(void *opaque, int irq, int level) +int kvm_pic_set_irq(struct kvm_pic *s, int irq, int src_id, int l) { - struct kvm_pic *s = opaque; int ret = -1; + int level; pic_lock(s); if (irq = 0 irq PIC_NUM_PINS) { + level = kvm_irq_line_state(s-irq_states[irq], src_id, l); ret = pic_set_irq1(s-pics[irq 3], irq 7, level); pic_update_irq(s); trace_kvm_pic_set_irq(irq 3, irq 7, s-pics[irq 3].elcr, @@ -205,6 +206,15 @@ int kvm_pic_set_irq(void *opaque, int irq, int level) return ret; } +void kvm_pic_clear_all(struct kvm_pic *s, int src_id) +{ + int i; + pic_lock(s); + for (i = 0; i PIC_NUM_PINS; i++) + __clear_bit(src_id, s-irq_states[i]); + pic_unlock(s); +} + /* * acknowledge interrupt 'irq' */ diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c index 26fd54d..268b332 100644 --- a/virt/kvm/ioapic.c +++ b/virt/kvm/ioapic.c @@ -191,17 +191,19 @@ static int ioapic_deliver(struct kvm_ioapic *ioapic, int irq) return kvm_irq_delivery_to_apic(ioapic-kvm, NULL, irqe); } -int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level) +int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int src_id, int l) { u32 old_irr; u32 mask = 1 irq; union kvm_ioapic_redirect_entry entry; int ret = 1; + int level; spin_lock(ioapic-lock); old_irr = ioapic-irr; if (irq = 0 irq IOAPIC_NUM_PINS) { entry = ioapic-redirtbl[irq]; + level = kvm_irq_line_state(ioapic-irq_states[irq], src_id, l); level ^= entry.fields.polarity; if (!level) ioapic-irr = ~mask; @@ -221,6 +223,15 @@ int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level) return ret; } +void kvm_ioapic_clear_all(struct kvm_ioapic *ioapic, int src_id) +{ + int i; + spin_lock(ioapic-lock); + for (i = 0; i KVM_IOAPIC_NUM_PINS; i++) + __clear_bit(src_id, ioapic-irq_states[i]); + spin_unlock(ioapic-lock); +} + static void __kvm_ioapic_update_eoi(struct kvm_ioapic *ioapic, int vector, int trigger_mode) { diff --git a/virt/kvm/ioapic.h b/virt/kvm/ioapic.h index 32872a0..a30abfe 100644 --- a/virt/kvm/ioapic.h +++ b/virt/kvm/ioapic.h @@ -74,7 +74,9 @@ void
Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
On Wed, Jul 18, 2012 at 09:27:42AM +0300, Gleb Natapov wrote: On Tue, Jul 17, 2012 at 07:14:52PM +0300, Michael S. Tsirkin wrote: _Seems_ racy, or _is_ racy? Please identify the race. Look at this: static inline int kvm_irq_line_state(unsigned long *irq_state, int irq_source_id, int level) { /* Logical OR for level trig interrupt */ if (level) set_bit(irq_source_id, irq_state); else clear_bit(irq_source_id, irq_state); return !!(*irq_state); } Now: If other CPU changes some other bit after the atomic change, it looks like !!(*irq_state) might return a stale value. CPU 0 clears bit 0. CPU 1 sets bit 1. CPU 1 sets level to 1. If CPU 0 sees a stale value now it will return 0 here and interrupt will get cleared. This will hardly happen on x86 especially since bit is set with serialized instruction. But there is actually a race here. CPU 0 clears bit 0. CPU 0 read irq_state as 0. CPU 1 sets level to 1. CPU 1 calls kvm_ioapic_set_irq(1). CPU 0 calls kvm_ioapic_set_irq(0). No ioapic thinks the level is 0 but irq_state is not 0. This untested and un-compiled patch should fix it. Getting rid of atomics completely makes me more comfortable, and by moving all bitmap handling to under pic/ioapic lock we can do just that. I just tested and posted a patch that fixes the race in this way. Could you take a look pls? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC untested] kvm_set_irq: report coalesced for clear
This creates a way to detect when kvm_set_irq(...,0) was run twice with the same source id by returning 0 in this case. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- This is on top of my bugfix patch. Uncompiled and untested. Alex, I think something like this patch will make it possible for you to simply do if (kvm_set_irq(, 0)) eventfd_signal() without need for further tricks. arch/x86/include/asm/kvm_host.h | 9 - arch/x86/kvm/i8259.c| 11 +++ virt/kvm/ioapic.c | 17 ++--- 3 files changed, 21 insertions(+), 16 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index fe6e9f1..6f5ed0a 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -802,16 +802,15 @@ int kvm_read_guest_page_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault); bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl); -static inline int kvm_irq_line_state(unsigned long *irq_state, +/* Tweak line state. Return true if state was changed. */ +static inline bool kvm_irq_line_state(unsigned long *irq_state, int irq_source_id, int level) { /* Logical OR for level trig interrupt */ if (level) - __set_bit(irq_source_id, irq_state); + return !__test_and_set_bit(irq_source_id, irq_state); else - __clear_bit(irq_source_id, irq_state); - - return !!(*irq_state); + return __test_and_clear_bit(irq_source_id, irq_state); } int kvm_pic_set_irq(struct kvm_pic *pic, int irq, int irq_source_id, int level); diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c index 95b504b..44e3635 100644 --- a/arch/x86/kvm/i8259.c +++ b/arch/x86/kvm/i8259.c @@ -190,12 +190,15 @@ void kvm_pic_update_irq(struct kvm_pic *s) int kvm_pic_set_irq(struct kvm_pic *s, int irq, int src_id, int l) { - int ret = -1; - int level; + int ret, level; pic_lock(s); - if (irq = 0 irq PIC_NUM_PINS) { - level = kvm_irq_line_state(s-irq_states[irq], src_id, l); + if (irq 0 || irq = PIC_NUM_PINS) { + ret = -1; + } else if (!kvm_irq_line_state(s-irq_states[irq], src_id, l)) { + ret = 0; + } else { + level = !!s-irq_states[irq]; ret = pic_set_irq1(s-pics[irq 3], irq 7, level); pic_update_irq(s); trace_kvm_pic_set_irq(irq 3, irq 7, s-pics[irq 3].elcr, diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c index 268b332..edc9445 100644 --- a/virt/kvm/ioapic.c +++ b/virt/kvm/ioapic.c @@ -196,18 +196,21 @@ int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int src_id, int l) u32 old_irr; u32 mask = 1 irq; union kvm_ioapic_redirect_entry entry; - int ret = 1; - int level; + int ret, level; spin_lock(ioapic-lock); old_irr = ioapic-irr; - if (irq = 0 irq IOAPIC_NUM_PINS) { + if (irq 0 || irq = IOAPIC_NUM_PINS) { + ret = -1; + } else if (!kvm_irq_line_state(ioapic-irq_states[irq], src_id, l)) { + ret = 0; + } else { entry = ioapic-redirtbl[irq]; - level = kvm_irq_line_state(ioapic-irq_states[irq], src_id, l); - level ^= entry.fields.polarity; - if (!level) + level = (!!ioapic-irq_states[irq]) ^ entry.fields.polarity; + if (!level) { ioapic-irr = ~mask; - else { + ret = 1; + } else { int edge = (entry.fields.trig_mode == IOAPIC_EDGE_TRIG); ioapic-irr |= mask; if ((edge old_irr != ioapic-irr) || -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: fix race with level interrupts
On Thu, 2012-07-19 at 00:52 +0300, Michael S. Tsirkin wrote: When more than 1 source id is in use for the same GSI, we have the following race related to handling irq_states: CPU 0 clears bit in irq_states. CPU 0 reads irq_state as 0. CPU 1 sets bit in irq_states. CPU 1 calls kvm_ioapic_set_irq(1). CPU 0 calls kvm_ioapic_set_irq(0). Now ioapic thinks the level is 0 but irq_state is not 0. Note that above is valid behaviour if CPU0 and CPU1 are using different source ids. Fix by performing all irq_states bitmap handling under pic/ioapic lock. This also removes the need for atomics with irq_states handling. Reported-by: Gleb Natapov g...@redhat.com Signed-off-by: Michael S. Tsirkin m...@redhat.com --- This survives stress testing for me (though I only tried virtio, not device assignment). Avi, Marcelo, though we have not observed this in the field, it's a bugfix so probably 3.5 material? I assume yes so the patch is against 3.5-rc7. Also stable? It's a very old bug. arch/x86/include/asm/kvm_host.h | 15 ++- arch/x86/kvm/i8259.c| 14 -- virt/kvm/ioapic.c | 13 - virt/kvm/ioapic.h | 4 +++- virt/kvm/irq_comm.c | 31 --- 5 files changed, 45 insertions(+), 32 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index db7c1f2..fe6e9f1 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -802,7 +802,20 @@ int kvm_read_guest_page_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault); bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl); -int kvm_pic_set_irq(void *opaque, int irq, int level); +static inline int kvm_irq_line_state(unsigned long *irq_state, + int irq_source_id, int level) +{ + /* Logical OR for level trig interrupt */ + if (level) + __set_bit(irq_source_id, irq_state); + else + __clear_bit(irq_source_id, irq_state); + + return !!(*irq_state); +} + +int kvm_pic_set_irq(struct kvm_pic *pic, int irq, int irq_source_id, int level); +void kvm_pic_clear_all(struct kvm_pic *pic, int irq_source_id); void kvm_inject_nmi(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c index 81cf4fa..95b504b 100644 --- a/arch/x86/kvm/i8259.c +++ b/arch/x86/kvm/i8259.c @@ -188,13 +188,14 @@ void kvm_pic_update_irq(struct kvm_pic *s) pic_unlock(s); } -int kvm_pic_set_irq(void *opaque, int irq, int level) +int kvm_pic_set_irq(struct kvm_pic *s, int irq, int src_id, int l) Please use irq_source_id and keep level. I hate reviewing code where I have to differentiate 'l' vs '1'. Same below, mixing src_id and irq_source_id in various places. { - struct kvm_pic *s = opaque; int ret = -1; + int level; And you'd avoid this variable pic_lock(s); if (irq = 0 irq PIC_NUM_PINS) { + level = kvm_irq_line_state(s-irq_states[irq], src_id, l); ret = pic_set_irq1(s-pics[irq 3], irq 7, level); pic_update_irq(s); trace_kvm_pic_set_irq(irq 3, irq 7, s-pics[irq 3].elcr, @@ -205,6 +206,15 @@ int kvm_pic_set_irq(void *opaque, int irq, int level) return ret; } +void kvm_pic_clear_all(struct kvm_pic *s, int src_id) +{ + int i; + pic_lock(s); + for (i = 0; i PIC_NUM_PINS; i++) + __clear_bit(src_id, s-irq_states[i]); + pic_unlock(s); +} + /* * acknowledge interrupt 'irq' */ diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c index 26fd54d..268b332 100644 --- a/virt/kvm/ioapic.c +++ b/virt/kvm/ioapic.c @@ -191,17 +191,19 @@ static int ioapic_deliver(struct kvm_ioapic *ioapic, int irq) return kvm_irq_delivery_to_apic(ioapic-kvm, NULL, irqe); } -int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level) +int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int src_id, int l) { u32 old_irr; u32 mask = 1 irq; union kvm_ioapic_redirect_entry entry; int ret = 1; + int level; spin_lock(ioapic-lock); old_irr = ioapic-irr; if (irq = 0 irq IOAPIC_NUM_PINS) { entry = ioapic-redirtbl[irq]; + level = kvm_irq_line_state(ioapic-irq_states[irq], src_id, l); level ^= entry.fields.polarity; if (!level) ioapic-irr = ~mask; @@ -221,6 +223,15 @@ int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level) return ret; } +void kvm_ioapic_clear_all(struct kvm_ioapic *ioapic, int src_id) +{ + int i; + spin_lock(ioapic-lock); + for (i = 0; i KVM_IOAPIC_NUM_PINS; i++) + __clear_bit(src_id, ioapic-irq_states[i]); +
Re: [PATCH] kvm: fix race with level interrupts
On Thu, 2012-07-19 at 00:52 +0300, Michael S. Tsirkin wrote: When more than 1 source id is in use for the same GSI, we have the following race related to handling irq_states: CPU 0 clears bit in irq_states. CPU 0 reads irq_state as 0. CPU 1 sets bit in irq_states. CPU 1 calls kvm_ioapic_set_irq(1). CPU 0 calls kvm_ioapic_set_irq(0). Now ioapic thinks the level is 0 but irq_state is not 0. Note that above is valid behaviour if CPU0 and CPU1 are using different source ids. Fix by performing all irq_states bitmap handling under pic/ioapic lock. This also removes the need for atomics with irq_states handling. Reported-by: Gleb Natapov g...@redhat.com Signed-off-by: Michael S. Tsirkin m...@redhat.com --- This survives stress testing for me (though I only tried virtio, not device assignment). Avi, Marcelo, though we have not observed this in the field, it's a bugfix so probably 3.5 material? I assume yes so the patch is against 3.5-rc7. Also stable? It's a very old bug. arch/x86/include/asm/kvm_host.h | 15 ++- arch/x86/kvm/i8259.c| 14 -- virt/kvm/ioapic.c | 13 - virt/kvm/ioapic.h | 4 +++- virt/kvm/irq_comm.c | 31 --- 5 files changed, 45 insertions(+), 32 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index db7c1f2..fe6e9f1 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -802,7 +802,20 @@ int kvm_read_guest_page_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault); bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl); -int kvm_pic_set_irq(void *opaque, int irq, int level); +static inline int kvm_irq_line_state(unsigned long *irq_state, + int irq_source_id, int level) This should probably be __kvm_irq_line_state given the calling restrictions. +{ + /* Logical OR for level trig interrupt */ + if (level) + __set_bit(irq_source_id, irq_state); + else + __clear_bit(irq_source_id, irq_state); + + return !!(*irq_state); +} + +int kvm_pic_set_irq(struct kvm_pic *pic, int irq, int irq_source_id, int level); +void kvm_pic_clear_all(struct kvm_pic *pic, int irq_source_id); void kvm_inject_nmi(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c index 81cf4fa..95b504b 100644 --- a/arch/x86/kvm/i8259.c +++ b/arch/x86/kvm/i8259.c @@ -188,13 +188,14 @@ void kvm_pic_update_irq(struct kvm_pic *s) pic_unlock(s); } -int kvm_pic_set_irq(void *opaque, int irq, int level) +int kvm_pic_set_irq(struct kvm_pic *s, int irq, int src_id, int l) { - struct kvm_pic *s = opaque; int ret = -1; + int level; pic_lock(s); if (irq = 0 irq PIC_NUM_PINS) { + level = kvm_irq_line_state(s-irq_states[irq], src_id, l); ret = pic_set_irq1(s-pics[irq 3], irq 7, level); pic_update_irq(s); trace_kvm_pic_set_irq(irq 3, irq 7, s-pics[irq 3].elcr, @@ -205,6 +206,15 @@ int kvm_pic_set_irq(void *opaque, int irq, int level) return ret; } +void kvm_pic_clear_all(struct kvm_pic *s, int src_id) +{ + int i; + pic_lock(s); + for (i = 0; i PIC_NUM_PINS; i++) + __clear_bit(src_id, s-irq_states[i]); + pic_unlock(s); +} + /* * acknowledge interrupt 'irq' */ diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c index 26fd54d..268b332 100644 --- a/virt/kvm/ioapic.c +++ b/virt/kvm/ioapic.c @@ -191,17 +191,19 @@ static int ioapic_deliver(struct kvm_ioapic *ioapic, int irq) return kvm_irq_delivery_to_apic(ioapic-kvm, NULL, irqe); } -int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level) +int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int src_id, int l) { u32 old_irr; u32 mask = 1 irq; union kvm_ioapic_redirect_entry entry; int ret = 1; + int level; spin_lock(ioapic-lock); old_irr = ioapic-irr; if (irq = 0 irq IOAPIC_NUM_PINS) { entry = ioapic-redirtbl[irq]; + level = kvm_irq_line_state(ioapic-irq_states[irq], src_id, l); level ^= entry.fields.polarity; if (!level) ioapic-irr = ~mask; @@ -221,6 +223,15 @@ int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level) return ret; } +void kvm_ioapic_clear_all(struct kvm_ioapic *ioapic, int src_id) +{ + int i; + spin_lock(ioapic-lock); + for (i = 0; i KVM_IOAPIC_NUM_PINS; i++) + __clear_bit(src_id, ioapic-irq_states[i]); + spin_unlock(ioapic-lock); +} + static void __kvm_ioapic_update_eoi(struct kvm_ioapic *ioapic, int vector,
Re: [PATCH RFC untested] kvm_set_irq: report coalesced for clear
On Thu, 2012-07-19 at 01:11 +0300, Michael S. Tsirkin wrote: This creates a way to detect when kvm_set_irq(...,0) was run twice with the same source id by returning 0 in this case. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- This is on top of my bugfix patch. Uncompiled and untested. Alex, I think something like this patch will make it possible for you to simply do if (kvm_set_irq(, 0)) eventfd_signal() without need for further tricks. arch/x86/include/asm/kvm_host.h | 9 - arch/x86/kvm/i8259.c| 11 +++ virt/kvm/ioapic.c | 17 ++--- 3 files changed, 21 insertions(+), 16 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index fe6e9f1..6f5ed0a 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -802,16 +802,15 @@ int kvm_read_guest_page_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault); bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl); -static inline int kvm_irq_line_state(unsigned long *irq_state, +/* Tweak line state. Return true if state was changed. */ +static inline bool kvm_irq_line_state(unsigned long *irq_state, int irq_source_id, int level) { /* Logical OR for level trig interrupt */ if (level) - __set_bit(irq_source_id, irq_state); + return !__test_and_set_bit(irq_source_id, irq_state); else - __clear_bit(irq_source_id, irq_state); - - return !!(*irq_state); + return __test_and_clear_bit(irq_source_id, irq_state); } int kvm_pic_set_irq(struct kvm_pic *pic, int irq, int irq_source_id, int level); diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c index 95b504b..44e3635 100644 --- a/arch/x86/kvm/i8259.c +++ b/arch/x86/kvm/i8259.c @@ -190,12 +190,15 @@ void kvm_pic_update_irq(struct kvm_pic *s) int kvm_pic_set_irq(struct kvm_pic *s, int irq, int src_id, int l) { - int ret = -1; - int level; + int ret, level; pic_lock(s); - if (irq = 0 irq PIC_NUM_PINS) { - level = kvm_irq_line_state(s-irq_states[irq], src_id, l); + if (irq 0 || irq = PIC_NUM_PINS) { Why test this under lock? Return error before lock, then it becomes int ret = 0; if (irq 0 || irq = PIC_NUM_PINS) return -1; pic_lock(s); if (kvm_irq_line_state(s-irq_states[irq], irq_source_id, level) { level = !!s-irq_states[irq]; ... } pic_unlock(s); return ret; + ret = -1; + } else if (!kvm_irq_line_state(s-irq_states[irq], src_id, l)) { + ret = 0; + } else { + level = !!s-irq_states[irq]; ret = pic_set_irq1(s-pics[irq 3], irq 7, level); pic_update_irq(s); trace_kvm_pic_set_irq(irq 3, irq 7, s-pics[irq 3].elcr, diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c index 268b332..edc9445 100644 --- a/virt/kvm/ioapic.c +++ b/virt/kvm/ioapic.c @@ -196,18 +196,21 @@ int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int src_id, int l) u32 old_irr; u32 mask = 1 irq; union kvm_ioapic_redirect_entry entry; - int ret = 1; - int level; + int ret, level; spin_lock(ioapic-lock); old_irr = ioapic-irr; - if (irq = 0 irq IOAPIC_NUM_PINS) { + if (irq 0 || irq = IOAPIC_NUM_PINS) { Same here + ret = -1; + } else if (!kvm_irq_line_state(ioapic-irq_states[irq], src_id, l)) { + ret = 0; + } else { entry = ioapic-redirtbl[irq]; - level = kvm_irq_line_state(ioapic-irq_states[irq], src_id, l); - level ^= entry.fields.polarity; - if (!level) + level = (!!ioapic-irq_states[irq]) ^ entry.fields.polarity; + if (!level) { ioapic-irr = ~mask; - else { + ret = 1; + } else { int edge = (entry.fields.trig_mode == IOAPIC_EDGE_TRIG); ioapic-irr |= mask; if ((edge old_irr != ioapic-irr) || -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: fix race with level interrupts
On Wed, Jul 18, 2012 at 04:16:14PM -0600, Alex Williamson wrote: On Thu, 2012-07-19 at 00:52 +0300, Michael S. Tsirkin wrote: When more than 1 source id is in use for the same GSI, we have the following race related to handling irq_states: CPU 0 clears bit in irq_states. CPU 0 reads irq_state as 0. CPU 1 sets bit in irq_states. CPU 1 calls kvm_ioapic_set_irq(1). CPU 0 calls kvm_ioapic_set_irq(0). Now ioapic thinks the level is 0 but irq_state is not 0. Note that above is valid behaviour if CPU0 and CPU1 are using different source ids. Fix by performing all irq_states bitmap handling under pic/ioapic lock. This also removes the need for atomics with irq_states handling. Reported-by: Gleb Natapov g...@redhat.com Signed-off-by: Michael S. Tsirkin m...@redhat.com --- This survives stress testing for me (though I only tried virtio, not device assignment). Avi, Marcelo, though we have not observed this in the field, it's a bugfix so probably 3.5 material? I assume yes so the patch is against 3.5-rc7. Also stable? It's a very old bug. arch/x86/include/asm/kvm_host.h | 15 ++- arch/x86/kvm/i8259.c| 14 -- virt/kvm/ioapic.c | 13 - virt/kvm/ioapic.h | 4 +++- virt/kvm/irq_comm.c | 31 --- 5 files changed, 45 insertions(+), 32 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index db7c1f2..fe6e9f1 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -802,7 +802,20 @@ int kvm_read_guest_page_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault); bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl); -int kvm_pic_set_irq(void *opaque, int irq, int level); +static inline int kvm_irq_line_state(unsigned long *irq_state, +int irq_source_id, int level) +{ + /* Logical OR for level trig interrupt */ + if (level) + __set_bit(irq_source_id, irq_state); + else + __clear_bit(irq_source_id, irq_state); + + return !!(*irq_state); +} + +int kvm_pic_set_irq(struct kvm_pic *pic, int irq, int irq_source_id, int level); +void kvm_pic_clear_all(struct kvm_pic *pic, int irq_source_id); void kvm_inject_nmi(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c index 81cf4fa..95b504b 100644 --- a/arch/x86/kvm/i8259.c +++ b/arch/x86/kvm/i8259.c @@ -188,13 +188,14 @@ void kvm_pic_update_irq(struct kvm_pic *s) pic_unlock(s); } -int kvm_pic_set_irq(void *opaque, int irq, int level) +int kvm_pic_set_irq(struct kvm_pic *s, int irq, int src_id, int l) Please use irq_source_id and keep level. I hate reviewing code where I have to differentiate 'l' vs '1'. l is an illegal variable name? Switch to a different font. Same below, mixing src_id and irq_source_id in various places. irq_source_id is too long, coding style says variable names should be short: LOCAL variable names should be short, and to the point. { - struct kvm_pic *s = opaque; int ret = -1; + int level; And you'd avoid this variable by giving a different meaning to the same variable in different parts of a function? Now *that* is a bad idea. l is level for a specific source, level is the resulting irq value. pic_lock(s); if (irq = 0 irq PIC_NUM_PINS) { + level = kvm_irq_line_state(s-irq_states[irq], src_id, l); ret = pic_set_irq1(s-pics[irq 3], irq 7, level); pic_update_irq(s); trace_kvm_pic_set_irq(irq 3, irq 7, s-pics[irq 3].elcr, @@ -205,6 +206,15 @@ int kvm_pic_set_irq(void *opaque, int irq, int level) return ret; } +void kvm_pic_clear_all(struct kvm_pic *s, int src_id) +{ + int i; + pic_lock(s); + for (i = 0; i PIC_NUM_PINS; i++) + __clear_bit(src_id, s-irq_states[i]); + pic_unlock(s); +} + /* * acknowledge interrupt 'irq' */ diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c index 26fd54d..268b332 100644 --- a/virt/kvm/ioapic.c +++ b/virt/kvm/ioapic.c @@ -191,17 +191,19 @@ static int ioapic_deliver(struct kvm_ioapic *ioapic, int irq) return kvm_irq_delivery_to_apic(ioapic-kvm, NULL, irqe); } -int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level) +int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int src_id, int l) { u32 old_irr; u32 mask = 1 irq; union kvm_ioapic_redirect_entry entry; int ret = 1; + int level; spin_lock(ioapic-lock); old_irr = ioapic-irr; if (irq = 0 irq IOAPIC_NUM_PINS) { entry = ioapic-redirtbl[irq]; + level =
Re: [RFC-v2 0/4] tcm_vhost+cmwq fabric driver code for-3.6
On Wed, 2012-07-18 at 08:42 -0500, Anthony Liguori wrote: On 07/17/2012 04:50 PM, Nicholas A. Bellinger wrote: On Tue, 2012-07-17 at 13:55 -0500, Anthony Liguori wrote: On 07/17/2012 10:05 AM, Michael S. Tsirkin wrote: On Wed, Jul 11, 2012 at 09:15:00PM +, Nicholas A. Bellinger wrote: SNIP But I do think the kernel should carefully consider whether it wants to support an interface like this. This an extremely complicated ABI with a lot of subtle details around state and compatibility. Are you absolutely confident that you can support a userspace application that expects to get exactly the same response from all possible commands in 20 kernel versions from now? Virtualization requires absolutely precise compatibility in terms of bugs and features. This is probably not something the TCM stack has had to consider yet. We most certainly have thought about long term userspace compatibility with TCM. Our userspace code (that's now available in all major distros) is completely forward-compatible with new fabric modules such as tcm_vhost. No update required. I'm not sure we're talking about the same thing when we say compatibility. I'm not talking about the API. I'm talking about the behavior of the commands that tcm_vhost supports. OK, I understand what your getting at now.. If you add support for a new command, you need to provide userspace a way to disable this command. If you change what gets reported for VPD, you need to provide userspace a way to make VPD look like what it did in a previous version. Basically, you need to be able to make a TCM device behave 100% the same as it did in an older version of the kernel. This is unique to virtualization due to live migration. If you migrate from a 3.6 kernel to a 3.8 kernel, you need to make sure that the 3.8 kernel's TCM device behaves exactly like the 3.6 kernel because the guest that is interacting with it does not realize that live migration happened. Yes, you can add knobs via configfs to control this behavior, but I think the question is, what's the plan for this? So we already allow for some types of CDBs emulation to be toggled via backend device attributes: root@tifa:/usr/src/target-pending.git# tree /sys/kernel/config/target/core/iblock_2/fioa/attrib/ /sys/kernel/config/target/core/iblock_2/fioa/attrib/ ├── block_size ├── emulate_dpo ├── emulate_fua_read ├── emulate_fua_write ├── emulate_rest_reord ├── emulate_tas ├── emulate_tpu ├── emulate_tpws ├── emulate_ua_intlck_ctrl ├── emulate_write_cache ├── enforce_pr_isids SNIP Some things like SPC-3 persistent reservations + implict/explict ALUA multipath currently can't be disabled, but adding two more backend attributes to disable/enable this logic individual is easy enough to do. So that said, I don't have a problem with adding the necessary device attributes to limit what type of CDBs a backend device is capable of processing. Trying to limiting this per-guest (instead of per-device) is where things get a little more tricky.. BTW, I think this is a good thing to cover in Documentation/vhost/tcm_vhost.txt. I think that's probably the only change that's needed here. Sure, but I'll need to know what else that you'd like to optionally restrict it terms of CDB processing that's not already there.. Thanks for your feedback! --nab -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: fix race with level interrupts
On Wed, Jul 18, 2012 at 04:26:41PM -0600, Alex Williamson wrote: On Thu, 2012-07-19 at 00:52 +0300, Michael S. Tsirkin wrote: When more than 1 source id is in use for the same GSI, we have the following race related to handling irq_states: CPU 0 clears bit in irq_states. CPU 0 reads irq_state as 0. CPU 1 sets bit in irq_states. CPU 1 calls kvm_ioapic_set_irq(1). CPU 0 calls kvm_ioapic_set_irq(0). Now ioapic thinks the level is 0 but irq_state is not 0. Note that above is valid behaviour if CPU0 and CPU1 are using different source ids. Fix by performing all irq_states bitmap handling under pic/ioapic lock. This also removes the need for atomics with irq_states handling. Reported-by: Gleb Natapov g...@redhat.com Signed-off-by: Michael S. Tsirkin m...@redhat.com --- This survives stress testing for me (though I only tried virtio, not device assignment). Avi, Marcelo, though we have not observed this in the field, it's a bugfix so probably 3.5 material? I assume yes so the patch is against 3.5-rc7. Also stable? It's a very old bug. arch/x86/include/asm/kvm_host.h | 15 ++- arch/x86/kvm/i8259.c| 14 -- virt/kvm/ioapic.c | 13 - virt/kvm/ioapic.h | 4 +++- virt/kvm/irq_comm.c | 31 --- 5 files changed, 45 insertions(+), 32 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index db7c1f2..fe6e9f1 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -802,7 +802,20 @@ int kvm_read_guest_page_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault); bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl); -int kvm_pic_set_irq(void *opaque, int irq, int level); +static inline int kvm_irq_line_state(unsigned long *irq_state, +int irq_source_id, int level) This should probably be __kvm_irq_line_state given the calling restrictions. It's such a trivial helper, do we need to split hairs about this? Look it's not a good time for minor coding style nits. 3.5 is imminent, it's about 1am here and I really don't have time to retest today, so we'll release another kernel with a bug. Could you focus on reviewing the functionality and correctness, and leave ideas for better variable naming aside until 3.6? +{ + /* Logical OR for level trig interrupt */ + if (level) + __set_bit(irq_source_id, irq_state); + else + __clear_bit(irq_source_id, irq_state); + + return !!(*irq_state); +} + +int kvm_pic_set_irq(struct kvm_pic *pic, int irq, int irq_source_id, int level); +void kvm_pic_clear_all(struct kvm_pic *pic, int irq_source_id); void kvm_inject_nmi(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c index 81cf4fa..95b504b 100644 --- a/arch/x86/kvm/i8259.c +++ b/arch/x86/kvm/i8259.c @@ -188,13 +188,14 @@ void kvm_pic_update_irq(struct kvm_pic *s) pic_unlock(s); } -int kvm_pic_set_irq(void *opaque, int irq, int level) +int kvm_pic_set_irq(struct kvm_pic *s, int irq, int src_id, int l) { - struct kvm_pic *s = opaque; int ret = -1; + int level; pic_lock(s); if (irq = 0 irq PIC_NUM_PINS) { + level = kvm_irq_line_state(s-irq_states[irq], src_id, l); ret = pic_set_irq1(s-pics[irq 3], irq 7, level); pic_update_irq(s); trace_kvm_pic_set_irq(irq 3, irq 7, s-pics[irq 3].elcr, @@ -205,6 +206,15 @@ int kvm_pic_set_irq(void *opaque, int irq, int level) return ret; } +void kvm_pic_clear_all(struct kvm_pic *s, int src_id) +{ + int i; + pic_lock(s); + for (i = 0; i PIC_NUM_PINS; i++) + __clear_bit(src_id, s-irq_states[i]); + pic_unlock(s); +} + /* * acknowledge interrupt 'irq' */ diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c index 26fd54d..268b332 100644 --- a/virt/kvm/ioapic.c +++ b/virt/kvm/ioapic.c @@ -191,17 +191,19 @@ static int ioapic_deliver(struct kvm_ioapic *ioapic, int irq) return kvm_irq_delivery_to_apic(ioapic-kvm, NULL, irqe); } -int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level) +int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int src_id, int l) { u32 old_irr; u32 mask = 1 irq; union kvm_ioapic_redirect_entry entry; int ret = 1; + int level; spin_lock(ioapic-lock); old_irr = ioapic-irr; if (irq = 0 irq IOAPIC_NUM_PINS) { entry = ioapic-redirtbl[irq]; + level = kvm_irq_line_state(ioapic-irq_states[irq], src_id, l); level ^= entry.fields.polarity; if (!level)
Re: [PATCH RFC untested] kvm_set_irq: report coalesced for clear
On Wed, Jul 18, 2012 at 04:40:38PM -0600, Alex Williamson wrote: On Thu, 2012-07-19 at 01:11 +0300, Michael S. Tsirkin wrote: This creates a way to detect when kvm_set_irq(...,0) was run twice with the same source id by returning 0 in this case. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- This is on top of my bugfix patch. Uncompiled and untested. Alex, I think something like this patch will make it possible for you to simply do if (kvm_set_irq(, 0)) eventfd_signal() without need for further tricks. arch/x86/include/asm/kvm_host.h | 9 - arch/x86/kvm/i8259.c| 11 +++ virt/kvm/ioapic.c | 17 ++--- 3 files changed, 21 insertions(+), 16 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index fe6e9f1..6f5ed0a 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -802,16 +802,15 @@ int kvm_read_guest_page_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault); bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl); -static inline int kvm_irq_line_state(unsigned long *irq_state, +/* Tweak line state. Return true if state was changed. */ +static inline bool kvm_irq_line_state(unsigned long *irq_state, int irq_source_id, int level) { /* Logical OR for level trig interrupt */ if (level) - __set_bit(irq_source_id, irq_state); + return !__test_and_set_bit(irq_source_id, irq_state); else - __clear_bit(irq_source_id, irq_state); - - return !!(*irq_state); + return __test_and_clear_bit(irq_source_id, irq_state); } int kvm_pic_set_irq(struct kvm_pic *pic, int irq, int irq_source_id, int level); diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c index 95b504b..44e3635 100644 --- a/arch/x86/kvm/i8259.c +++ b/arch/x86/kvm/i8259.c @@ -190,12 +190,15 @@ void kvm_pic_update_irq(struct kvm_pic *s) int kvm_pic_set_irq(struct kvm_pic *s, int irq, int src_id, int l) { - int ret = -1; - int level; + int ret, level; pic_lock(s); - if (irq = 0 irq PIC_NUM_PINS) { - level = kvm_irq_line_state(s-irq_states[irq], src_id, l); + if (irq 0 || irq = PIC_NUM_PINS) { Why test this under lock? Return error before lock, then it becomes int ret = 0; if (irq 0 || irq = PIC_NUM_PINS) return -1; pic_lock(s); if (kvm_irq_line_state(s-irq_states[irq], irq_source_id, level) { level = !!s-irq_states[irq]; ... } pic_unlock(s); return ret; + ret = -1; + } else if (!kvm_irq_line_state(s-irq_states[irq], src_id, l)) { + ret = 0; + } else { + level = !!s-irq_states[irq]; ret = pic_set_irq1(s-pics[irq 3], irq 7, level); pic_update_irq(s); trace_kvm_pic_set_irq(irq 3, irq 7, s-pics[irq 3].elcr, diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c index 268b332..edc9445 100644 --- a/virt/kvm/ioapic.c +++ b/virt/kvm/ioapic.c @@ -196,18 +196,21 @@ int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int src_id, int l) u32 old_irr; u32 mask = 1 irq; union kvm_ioapic_redirect_entry entry; - int ret = 1; - int level; + int ret, level; spin_lock(ioapic-lock); old_irr = ioapic-irr; - if (irq = 0 irq IOAPIC_NUM_PINS) { + if (irq 0 || irq = IOAPIC_NUM_PINS) { Same here + ret = -1; + } else if (!kvm_irq_line_state(ioapic-irq_states[irq], src_id, l)) { + ret = 0; + } else { entry = ioapic-redirtbl[irq]; - level = kvm_irq_line_state(ioapic-irq_states[irq], src_id, l); - level ^= entry.fields.polarity; - if (!level) + level = (!!ioapic-irq_states[irq]) ^ entry.fields.polarity; + if (!level) { ioapic-irr = ~mask; - else { + ret = 1; + } else { int edge = (entry.fields.trig_mode == IOAPIC_EDGE_TRIG); ioapic-irr |= mask; if ((edge old_irr != ioapic-irr) || Sure, go ahead. I just sent this to show how to address the locking with EOI patches, don't intent to pursue it myself. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: fix race with level interrupts
On Thu, 2012-07-19 at 01:49 +0300, Michael S. Tsirkin wrote: On Wed, Jul 18, 2012 at 04:26:41PM -0600, Alex Williamson wrote: On Thu, 2012-07-19 at 00:52 +0300, Michael S. Tsirkin wrote: When more than 1 source id is in use for the same GSI, we have the following race related to handling irq_states: CPU 0 clears bit in irq_states. CPU 0 reads irq_state as 0. CPU 1 sets bit in irq_states. CPU 1 calls kvm_ioapic_set_irq(1). CPU 0 calls kvm_ioapic_set_irq(0). Now ioapic thinks the level is 0 but irq_state is not 0. Note that above is valid behaviour if CPU0 and CPU1 are using different source ids. Fix by performing all irq_states bitmap handling under pic/ioapic lock. This also removes the need for atomics with irq_states handling. Reported-by: Gleb Natapov g...@redhat.com Signed-off-by: Michael S. Tsirkin m...@redhat.com --- This survives stress testing for me (though I only tried virtio, not device assignment). Avi, Marcelo, though we have not observed this in the field, it's a bugfix so probably 3.5 material? I assume yes so the patch is against 3.5-rc7. Also stable? It's a very old bug. arch/x86/include/asm/kvm_host.h | 15 ++- arch/x86/kvm/i8259.c| 14 -- virt/kvm/ioapic.c | 13 - virt/kvm/ioapic.h | 4 +++- virt/kvm/irq_comm.c | 31 --- 5 files changed, 45 insertions(+), 32 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index db7c1f2..fe6e9f1 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -802,7 +802,20 @@ int kvm_read_guest_page_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault); bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl); -int kvm_pic_set_irq(void *opaque, int irq, int level); +static inline int kvm_irq_line_state(unsigned long *irq_state, + int irq_source_id, int level) This should probably be __kvm_irq_line_state given the calling restrictions. It's such a trivial helper, do we need to split hairs about this? Look it's not a good time for minor coding style nits. 3.5 is imminent, it's about 1am here and I really don't have time to retest today, so we'll release another kernel with a bug. Could you focus on reviewing the functionality and correctness, and leave ideas for better variable naming aside until 3.6? Please get off your high horse, this bug has existed for a long time and nobody has noticed. __ indicates locking and makes people think twice about arbitrarily calling it. Correct naming prevents future bugs, which is something you've been riding me on in other areas... +{ + /* Logical OR for level trig interrupt */ + if (level) + __set_bit(irq_source_id, irq_state); + else + __clear_bit(irq_source_id, irq_state); + + return !!(*irq_state); +} + +int kvm_pic_set_irq(struct kvm_pic *pic, int irq, int irq_source_id, int level); +void kvm_pic_clear_all(struct kvm_pic *pic, int irq_source_id); void kvm_inject_nmi(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c index 81cf4fa..95b504b 100644 --- a/arch/x86/kvm/i8259.c +++ b/arch/x86/kvm/i8259.c @@ -188,13 +188,14 @@ void kvm_pic_update_irq(struct kvm_pic *s) pic_unlock(s); } -int kvm_pic_set_irq(void *opaque, int irq, int level) +int kvm_pic_set_irq(struct kvm_pic *s, int irq, int src_id, int l) { - struct kvm_pic *s = opaque; int ret = -1; + int level; pic_lock(s); if (irq = 0 irq PIC_NUM_PINS) { + level = kvm_irq_line_state(s-irq_states[irq], src_id, l); ret = pic_set_irq1(s-pics[irq 3], irq 7, level); pic_update_irq(s); trace_kvm_pic_set_irq(irq 3, irq 7, s-pics[irq 3].elcr, @@ -205,6 +206,15 @@ int kvm_pic_set_irq(void *opaque, int irq, int level) return ret; } +void kvm_pic_clear_all(struct kvm_pic *s, int src_id) +{ + int i; + pic_lock(s); + for (i = 0; i PIC_NUM_PINS; i++) + __clear_bit(src_id, s-irq_states[i]); + pic_unlock(s); +} + /* * acknowledge interrupt 'irq' */ diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c index 26fd54d..268b332 100644 --- a/virt/kvm/ioapic.c +++ b/virt/kvm/ioapic.c @@ -191,17 +191,19 @@ static int ioapic_deliver(struct kvm_ioapic *ioapic, int irq) return kvm_irq_delivery_to_apic(ioapic-kvm, NULL, irqe); } -int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level) +int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int src_id, int l) { u32 old_irr;
Re: [PATCH] kvm: fix race with level interrupts
On Thu, 2012-07-19 at 01:44 +0300, Michael S. Tsirkin wrote: On Wed, Jul 18, 2012 at 04:16:14PM -0600, Alex Williamson wrote: On Thu, 2012-07-19 at 00:52 +0300, Michael S. Tsirkin wrote: When more than 1 source id is in use for the same GSI, we have the following race related to handling irq_states: CPU 0 clears bit in irq_states. CPU 0 reads irq_state as 0. CPU 1 sets bit in irq_states. CPU 1 calls kvm_ioapic_set_irq(1). CPU 0 calls kvm_ioapic_set_irq(0). Now ioapic thinks the level is 0 but irq_state is not 0. Note that above is valid behaviour if CPU0 and CPU1 are using different source ids. Fix by performing all irq_states bitmap handling under pic/ioapic lock. This also removes the need for atomics with irq_states handling. Reported-by: Gleb Natapov g...@redhat.com Signed-off-by: Michael S. Tsirkin m...@redhat.com --- This survives stress testing for me (though I only tried virtio, not device assignment). Avi, Marcelo, though we have not observed this in the field, it's a bugfix so probably 3.5 material? I assume yes so the patch is against 3.5-rc7. Also stable? It's a very old bug. arch/x86/include/asm/kvm_host.h | 15 ++- arch/x86/kvm/i8259.c| 14 -- virt/kvm/ioapic.c | 13 - virt/kvm/ioapic.h | 4 +++- virt/kvm/irq_comm.c | 31 --- 5 files changed, 45 insertions(+), 32 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index db7c1f2..fe6e9f1 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -802,7 +802,20 @@ int kvm_read_guest_page_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault); bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl); -int kvm_pic_set_irq(void *opaque, int irq, int level); +static inline int kvm_irq_line_state(unsigned long *irq_state, + int irq_source_id, int level) +{ + /* Logical OR for level trig interrupt */ + if (level) + __set_bit(irq_source_id, irq_state); + else + __clear_bit(irq_source_id, irq_state); + + return !!(*irq_state); +} + +int kvm_pic_set_irq(struct kvm_pic *pic, int irq, int irq_source_id, int level); +void kvm_pic_clear_all(struct kvm_pic *pic, int irq_source_id); void kvm_inject_nmi(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c index 81cf4fa..95b504b 100644 --- a/arch/x86/kvm/i8259.c +++ b/arch/x86/kvm/i8259.c @@ -188,13 +188,14 @@ void kvm_pic_update_irq(struct kvm_pic *s) pic_unlock(s); } -int kvm_pic_set_irq(void *opaque, int irq, int level) +int kvm_pic_set_irq(struct kvm_pic *s, int irq, int src_id, int l) Please use irq_source_id and keep level. I hate reviewing code where I have to differentiate 'l' vs '1'. l is an illegal variable name? Switch to a different font. WTF Same below, mixing src_id and irq_source_id in various places. irq_source_id is too long, coding style says variable names should be short: LOCAL variable names should be short, and to the point. irq_source_id is used all over the code, so either convert them all of be consistent. { - struct kvm_pic *s = opaque; int ret = -1; + int level; And you'd avoid this variable by giving a different meaning to the same variable in different parts of a function? Now *that* is a bad idea. l is level for a specific source, level is the resulting irq value. Then maybe 'l' isn't a meaningful variable name. pic_lock(s); if (irq = 0 irq PIC_NUM_PINS) { + level = kvm_irq_line_state(s-irq_states[irq], src_id, l); ret = pic_set_irq1(s-pics[irq 3], irq 7, level); pic_update_irq(s); trace_kvm_pic_set_irq(irq 3, irq 7, s-pics[irq 3].elcr, @@ -205,6 +206,15 @@ int kvm_pic_set_irq(void *opaque, int irq, int level) return ret; } +void kvm_pic_clear_all(struct kvm_pic *s, int src_id) +{ + int i; + pic_lock(s); + for (i = 0; i PIC_NUM_PINS; i++) + __clear_bit(src_id, s-irq_states[i]); + pic_unlock(s); +} + /* * acknowledge interrupt 'irq' */ diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c index 26fd54d..268b332 100644 --- a/virt/kvm/ioapic.c +++ b/virt/kvm/ioapic.c @@ -191,17 +191,19 @@ static int ioapic_deliver(struct kvm_ioapic *ioapic, int irq) return kvm_irq_delivery_to_apic(ioapic-kvm, NULL, irqe); } -int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level) +int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int src_id, int l) { u32
Re: UIO: missing resource mapping
On Wed, Jul 18, 2012 at 12:40:47PM +0200, Dominic Eschweiler wrote: Am Montag, den 16.07.2012, 23:58 +0200 schrieb Hans J. Koch: Try to hack up a patch to add generic BAR mapping to uio_pci_generic.c and post it for review. Here we go ... Great! I'm a bit under time pressure with my current work ATM. A call to open(/dev/hansjkoch, O_NONBLOCK) only returns -EBUSY. So it'll take me one or two days to review it thoroughly. But at first sight, it is what I had in mind. You'll hear from me soon, thanks for your work! Comments and reviews from others are welcome... Hans Signed-off-by: Dominic Eschweiler eschwei...@fias.uni-frankfurt.de diff --git a/drivers/uio/uio_pci_generic.c b/drivers/uio/uio_pci_generic.c index 0bd08ef..e25991e 100644 --- a/drivers/uio/uio_pci_generic.c +++ b/drivers/uio/uio_pci_generic.c @@ -25,10 +25,12 @@ #include linux/slab.h #include linux/uio_driver.h -#define DRIVER_VERSION 0.01.0 +#define DRIVER_VERSION 0.02.0 #define DRIVER_AUTHORMichael S. Tsirkin m...@redhat.com #define DRIVER_DESC Generic UIO driver for PCI 2.3 devices +#define DRV_NAME uio_pci_generic + struct uio_pci_generic_dev { struct uio_info info; struct pci_dev *pdev; @@ -58,6 +60,7 @@ static int __devinit probe(struct pci_dev *pdev, { struct uio_pci_generic_dev *gdev; int err; + int i; err = pci_enable_device(pdev); if (err) { @@ -67,8 +70,7 @@ static int __devinit probe(struct pci_dev *pdev, } if (!pdev-irq) { - dev_warn(pdev-dev, No IRQ assigned to device: - no support for interrupts?\n); + dev_warn(pdev-dev, No IRQ assigned to device: no support for interrupts?\n); pci_disable_device(pdev); return -ENODEV; } @@ -91,10 +93,31 @@ static int __devinit probe(struct pci_dev *pdev, gdev-info.handler = irqhandler; gdev-pdev = pdev; + /* request regions */ + err = pci_request_regions(pdev, DRV_NAME); + if (err) { + dev_err(pdev-dev, Couldn't get PCI resources, aborting\n); + return err; + } + + /* create attributes for BAR mappings */ + for (i = 0; i PCI_NUM_RESOURCES; i++) { + if (pdev-resource[i].flags + (pdev-resource[i].flags IORESOURCE_MEM)) { + gdev-info.mem[i].addr = pci_resource_start(pdev, i); + gdev-info.mem[i].size = pci_resource_len(pdev, i); + gdev-info.mem[i].internal_addr = NULL; + gdev-info.mem[i].memtype = UIO_MEM_PHYS; + } + } + if (uio_register_device(pdev-dev, gdev-info)) goto err_register; pci_set_drvdata(pdev, gdev); + pr_info(UIO_PCI_GENERIC : initialized new device (%x %x)\n, + pdev-vendor, pdev-device); + return 0; err_register: kfree(gdev); @@ -107,17 +130,21 @@ err_verify: static void remove(struct pci_dev *pdev) { struct uio_pci_generic_dev *gdev = pci_get_drvdata(pdev); - uio_unregister_device(gdev-info); + + pci_release_regions(pdev); pci_disable_device(pdev); kfree(gdev); + + pr_info(UIO_PCI_GENERIC : removed device (%x %x)\n, + pdev-vendor, pdev-device); } static struct pci_driver driver = { - .name = uio_pci_generic, + .name = DRV_NAME, .id_table = NULL, /* only dynamic id's */ - .probe = probe, - .remove = remove, + .probe= probe, + .remove = remove, }; static int __init init(void) -- Gruß Dominic Frankfurt Institute for Advanced Studies (FIAS) Ruth-Moufang-Straße 1 D-60438 Frankfurt am Main Germany Phone: +49 69 79844114 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2 v4] KVM: PPC: booke: Add watchdog emulation
On 07/18/2012 04:39 AM, Bharat Bhushan wrote: This patch adds the watchdog emulation in KVM. The watchdog emulation is enabled by KVM_ENABLE_CAP(KVM_CAP_PPC_WDT) ioctl. The kernel timer are used for watchdog emulation and emulates h/w watchdog state machine. On watchdog timer expiry, it exit to QEMU if TCR.WRC is non ZERO. QEMU can reset/shutdown etc depending upon how it is configured. Signed-off-by: Liu Yu yu@freescale.com Signed-off-by: Scott Wood scottw...@freescale.com Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com Please put a note before your signoff stating that it's been modified since the previous signoffs, such as: [bharat.bhus...@freescale.com: reworked patch] @@ -482,6 +484,7 @@ struct kvm_vcpu_arch { u8 osi_needed; u8 osi_enabled; u8 papr_enabled; + u8 watchdog_enable; s/enable;/enabled;/ +static void arm_next_watchdog(struct kvm_vcpu *vcpu) +{ + unsigned long nr_jiffies; + + nr_jiffies = watchdog_next_timeout(vcpu); + spin_lock(vcpu-arch.wdt_lock); Do watchdog_next_timeout() from within the lock. +void kvmppc_watchdog_func(unsigned long data) +{ + struct kvm_vcpu *vcpu = (struct kvm_vcpu *)data; + u32 tsr, new_tsr; + int final; + + do { + new_tsr = tsr = vcpu-arch.tsr; + final = 0; + + /* Time out event */ + if (tsr TSR_ENW) { + if (tsr TSR_WIS) { + final = 1; + } else { + new_tsr = tsr | TSR_WIS; + } Unnecessary braces on the inner if/else. + } else { + new_tsr = tsr | TSR_ENW; + } + } while (cmpxchg(vcpu-arch.tsr, tsr, new_tsr) != tsr); + + if (new_tsr TSR_WIS) { + smp_wmb(); + kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu); + kvm_vcpu_kick(vcpu); + } + + /* + * If this is final watchdog expiry and some action is required + * then exit to userspace. + */ + if (final (vcpu-arch.tcr TCR_WRC_MASK)) { + smp_wmb(); + kvm_make_request(KVM_REQ_WATCHDOG, vcpu); + kvm_vcpu_kick(vcpu); + } + + + /* + * Stop running the watchdog timer after final expiration to + * prevent the host from being flooded with timers if the + * guest sets a short period. + * Timers will resume when TSR/TCR is updated next time. + */ + if (!final) + arm_next_watchdog(vcpu); +} static void update_timer_ints(struct kvm_vcpu *vcpu) Blank line between functions static void kvmppc_core_check_exceptions(struct kvm_vcpu *vcpu) @@ -516,6 +627,13 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu) goto out; } + if (kvm_check_request(KVM_REQ_WATCHDOG, vcpu) + vcpu-arch.watchdog_enable) { Check .watchdog_enabled when you make the request, not here. + kvm_run-exit_reason = KVM_EXIT_WATCHDOG; + ret = 0; + goto out; + } I think we should check that ENW/WIS are still set. Now that we don't use TSR[WRS], this is the only way QEMU can preemptively clear a pending final expiration after leaving debug halt. If QEMU doesn't do this, and KVM comes back with a watchdog exit, QEMU won't know if it's a new legitimate one, or if it expired during the debug halt. This would be less ugly than QEMU doing a reset-TSR[WIS,ENW]-and-ignore on the first watchdog exit after a debug halt, and letting the expiration happen again if the guest is really stuck. if (sregs-u.e.update_special KVM_SREGS_E_UPDATE_TSR) { + u32 old_tsr = vcpu-arch.tsr; + vcpu-arch.tsr = sregs-u.e.tsr; + + if ((old_tsr ^ vcpu-arch.tsr) + (TSR_ENW | TSR_WIS | TCR_WRC_MASK)) + arm_next_watchdog(vcpu); Do we still do anything with TSR[WRS] at all? diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 87f4dc8..3c50b81 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -38,9 +38,11 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *v) { - return !(v-arch.shared-msr MSR_WE) || -!!(v-arch.pending_exceptions) || -v-requests; + bool ret = !(v-arch.shared-msr MSR_WE) || +!!(v-arch.pending_exceptions) || +v-requests; + + return ret; } There's no need to change this function anymore. +#define KVM_CAP_PPC_WDT 81 s/WDT/WATCHDOG/ or better, s/WDT/BOOKE_WATCHDOG/ -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/2 v4] KVM: PPC: booke: Add watchdog emulation
-Original Message- From: Wood Scott-B07421 Sent: Thursday, July 19, 2012 7:56 AM To: Bhushan Bharat-R65777 Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; ag...@suse.de; Bhushan Bharat- R65777 Subject: Re: [PATCH 2/2 v4] KVM: PPC: booke: Add watchdog emulation On 07/18/2012 04:39 AM, Bharat Bhushan wrote: This patch adds the watchdog emulation in KVM. The watchdog emulation is enabled by KVM_ENABLE_CAP(KVM_CAP_PPC_WDT) ioctl. The kernel timer are used for watchdog emulation and emulates h/w watchdog state machine. On watchdog timer expiry, it exit to QEMU if TCR.WRC is non ZERO. QEMU can reset/shutdown etc depending upon how it is configured. Signed-off-by: Liu Yu yu@freescale.com Signed-off-by: Scott Wood scottw...@freescale.com Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com Please put a note before your signoff stating that it's been modified since the previous signoffs, such as: [bharat.bhus...@freescale.com: reworked patch] @@ -482,6 +484,7 @@ struct kvm_vcpu_arch { u8 osi_needed; u8 osi_enabled; u8 papr_enabled; + u8 watchdog_enable; s/enable;/enabled;/ +static void arm_next_watchdog(struct kvm_vcpu *vcpu) +{ + unsigned long nr_jiffies; + + nr_jiffies = watchdog_next_timeout(vcpu); + spin_lock(vcpu-arch.wdt_lock); Do watchdog_next_timeout() from within the lock. Why you think so? Is the next timeout calculation needed to be protected? +void kvmppc_watchdog_func(unsigned long data) +{ + struct kvm_vcpu *vcpu = (struct kvm_vcpu *)data; + u32 tsr, new_tsr; + int final; + + do { + new_tsr = tsr = vcpu-arch.tsr; + final = 0; + + /* Time out event */ + if (tsr TSR_ENW) { + if (tsr TSR_WIS) { + final = 1; + } else { + new_tsr = tsr | TSR_WIS; + } Unnecessary braces on the inner if/else. + } else { + new_tsr = tsr | TSR_ENW; + } + } while (cmpxchg(vcpu-arch.tsr, tsr, new_tsr) != tsr); + + if (new_tsr TSR_WIS) { + smp_wmb(); + kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu); + kvm_vcpu_kick(vcpu); + } + + /* +* If this is final watchdog expiry and some action is required +* then exit to userspace. +*/ + if (final (vcpu-arch.tcr TCR_WRC_MASK)) { + smp_wmb(); + kvm_make_request(KVM_REQ_WATCHDOG, vcpu); + kvm_vcpu_kick(vcpu); + } + + + /* +* Stop running the watchdog timer after final expiration to +* prevent the host from being flooded with timers if the +* guest sets a short period. +* Timers will resume when TSR/TCR is updated next time. +*/ + if (!final) + arm_next_watchdog(vcpu); +} static void update_timer_ints(struct kvm_vcpu *vcpu) Blank line between functions static void kvmppc_core_check_exceptions(struct kvm_vcpu *vcpu) @@ -516,6 +627,13 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu) goto out; } + if (kvm_check_request(KVM_REQ_WATCHDOG, vcpu) + vcpu-arch.watchdog_enable) { Check .watchdog_enabled when you make the request, not here. + kvm_run-exit_reason = KVM_EXIT_WATCHDOG; + ret = 0; + goto out; + } I think we should check that ENW/WIS are still set. Now that we don't use TSR[WRS], this is the only way QEMU can preemptively clear a pending final expiration after leaving debug halt. If QEMU doesn't do this, and KVM comes back with a watchdog exit, QEMU won't know if it's a new legitimate one, or if it expired during the debug halt. This would be less ugly than QEMU doing a reset-TSR[WIS,ENW]-and-ignore on the first watchdog exit after a debug halt, and letting the expiration happen again if the guest is really stuck. ahh, What about clearing KVM_REQ_WATCHDOG request when QEMU clears ENW|WIS (or whenever KVM or QEMU clears ENW|WIS)? Thanks -Bharat if (sregs-u.e.update_special KVM_SREGS_E_UPDATE_TSR) { + u32 old_tsr = vcpu-arch.tsr; + vcpu-arch.tsr = sregs-u.e.tsr; + + if ((old_tsr ^ vcpu-arch.tsr) + (TSR_ENW | TSR_WIS | TCR_WRC_MASK)) + arm_next_watchdog(vcpu); Do we still do anything with TSR[WRS] at all? diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 87f4dc8..3c50b81 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -38,9 +38,11 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *v) { - return !(v-arch.shared-msr MSR_WE) || - !!(v-arch.pending_exceptions) || - v-requests; + bool ret = !(v-arch.shared-msr MSR_WE) || +
[PATCH 2/2 v3] KVM: PPC: booke: Add watchdog emulation
This patch adds the watchdog emulation in KVM. The watchdog emulation is enabled by KVM_ENABLE_CAP(KVM_CAP_PPC_WDT) ioctl. The kernel timer are used for watchdog emulation and emulates h/w watchdog state machine. On watchdog timer expiry, it exit to QEMU if TCR.WRC is non ZERO. QEMU can reset/shutdown etc depending upon how it is configured. Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com --- v3: - Using KVM_REQ_WATCHDOG for userspace exit. - TSR changes left for vcpu thread. - Other review comments on v2 arch/powerpc/include/asm/kvm_host.h |3 + arch/powerpc/include/asm/kvm_ppc.h |3 + arch/powerpc/include/asm/reg_booke.h |7 ++ arch/powerpc/kvm/booke.c | 140 ++ arch/powerpc/kvm/booke_emulate.c |8 ++ arch/powerpc/kvm/powerpc.c | 27 ++- include/linux/kvm.h |2 + include/linux/kvm_host.h |1 + 8 files changed, 187 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 50ea12f..01047f4 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -467,6 +467,8 @@ struct kvm_vcpu_arch { ulong fault_esr; ulong queued_dear; ulong queued_esr; + spinlock_t wdt_lock; + struct timer_list wdt_timer; u32 tlbcfg[4]; u32 mmucfg; u32 epr; @@ -482,6 +484,7 @@ struct kvm_vcpu_arch { u8 osi_needed; u8 osi_enabled; u8 papr_enabled; + u8 watchdog_enable; u8 sane; u8 cpu_type; u8 hcall_needed; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 0124937..e5cf4b9 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -67,6 +67,7 @@ extern int kvmppc_emulate_mmio(struct kvm_run *run, struct kvm_vcpu *vcpu); extern void kvmppc_emulate_dec(struct kvm_vcpu *vcpu); extern u32 kvmppc_get_dec(struct kvm_vcpu *vcpu, u64 tb); extern void kvmppc_decrementer_func(unsigned long data); +extern void kvmppc_watchdog_func(unsigned long data); extern int kvmppc_sanity_check(struct kvm_vcpu *vcpu); /* Core-specific hooks */ @@ -104,6 +105,8 @@ extern void kvmppc_core_queue_external(struct kvm_vcpu *vcpu, struct kvm_interrupt *irq); extern void kvmppc_core_dequeue_external(struct kvm_vcpu *vcpu, struct kvm_interrupt *irq); +extern void kvmppc_core_queue_watchdog(struct kvm_vcpu *vcpu); +extern void kvmppc_core_dequeue_watchdog(struct kvm_vcpu *vcpu); extern int kvmppc_core_emulate_op(struct kvm_run *run, struct kvm_vcpu *vcpu, unsigned int op, int *advance); diff --git a/arch/powerpc/include/asm/reg_booke.h b/arch/powerpc/include/asm/reg_booke.h index 2d916c4..e07e6af 100644 --- a/arch/powerpc/include/asm/reg_booke.h +++ b/arch/powerpc/include/asm/reg_booke.h @@ -539,6 +539,13 @@ #define TCR_FIE0x0080 /* FIT Interrupt Enable */ #define TCR_ARE0x0040 /* Auto Reload Enable */ +#ifdef CONFIG_E500 +#define TCR_GET_WP(tcr) tcr) 0xC000) 30) | \ + (((tcr) 0x1E) 15)) +#else +#define TCR_GET_WP(tcr) (((tcr) 0xC000) 30) +#endif + /* Bit definitions for the TSR. */ #define TSR_ENW0x8000 /* Enable Next Watchdog */ #define TSR_WIS0x4000 /* WDT Interrupt Status */ diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index d25a097..9682506 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -206,6 +206,16 @@ void kvmppc_core_dequeue_external(struct kvm_vcpu *vcpu, clear_bit(BOOKE_IRQPRIO_EXTERNAL_LEVEL, vcpu-arch.pending_exceptions); } +void kvmppc_core_queue_watchdog(struct kvm_vcpu *vcpu) +{ + kvmppc_booke_queue_irqprio(vcpu, BOOKE_IRQPRIO_WATCHDOG); +} + +void kvmppc_core_dequeue_watchdog(struct kvm_vcpu *vcpu) +{ + clear_bit(BOOKE_IRQPRIO_WATCHDOG, vcpu-arch.pending_exceptions); +} + static void set_guest_srr(struct kvm_vcpu *vcpu, unsigned long srr0, u32 srr1) { #ifdef CONFIG_KVM_BOOKE_HV @@ -325,6 +335,7 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu, msr_mask = MSR_CE | MSR_ME | MSR_DE; int_class = INT_CLASS_NONCRIT; break; + case BOOKE_IRQPRIO_WATCHDOG: case BOOKE_IRQPRIO_CRITICAL: case BOOKE_IRQPRIO_DBELL_CRIT: allowed = vcpu-arch.shared-msr MSR_CE; @@ -404,12 +415,112 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu, return allowed; } +/* + * Return the number of jiffies until the next timeout. If the timeout is + * longer than the NEXT_TIMER_MAX_DELTA, then return NEXT_TIMER_MAX_DELTA + * because the larger value can break the timer APIs. + */
[PATCH 2/2 v4] KVM: PPC: booke: Add watchdog emulation
This patch adds the watchdog emulation in KVM. The watchdog emulation is enabled by KVM_ENABLE_CAP(KVM_CAP_PPC_WDT) ioctl. The kernel timer are used for watchdog emulation and emulates h/w watchdog state machine. On watchdog timer expiry, it exit to QEMU if TCR.WRC is non ZERO. QEMU can reset/shutdown etc depending upon how it is configured. Signed-off-by: Liu Yu yu@freescale.com Signed-off-by: Scott Wood scottw...@freescale.com Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com --- v4: - in v3 i forgot to add Scott Wood and Liu Yu signoff v3: - Using KVM_REQ_WATCHDOG for userspace exit. - TSR changes left for vcpu thread. - Other review comments on v2 arch/powerpc/include/asm/kvm_host.h |3 + arch/powerpc/include/asm/kvm_ppc.h |3 + arch/powerpc/include/asm/reg_booke.h |7 ++ arch/powerpc/kvm/booke.c | 140 ++ arch/powerpc/kvm/booke_emulate.c |8 ++ arch/powerpc/kvm/powerpc.c | 27 ++- include/linux/kvm.h |2 + include/linux/kvm_host.h |1 + 8 files changed, 187 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 50ea12f..01047f4 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -467,6 +467,8 @@ struct kvm_vcpu_arch { ulong fault_esr; ulong queued_dear; ulong queued_esr; + spinlock_t wdt_lock; + struct timer_list wdt_timer; u32 tlbcfg[4]; u32 mmucfg; u32 epr; @@ -482,6 +484,7 @@ struct kvm_vcpu_arch { u8 osi_needed; u8 osi_enabled; u8 papr_enabled; + u8 watchdog_enable; u8 sane; u8 cpu_type; u8 hcall_needed; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 0124937..e5cf4b9 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -67,6 +67,7 @@ extern int kvmppc_emulate_mmio(struct kvm_run *run, struct kvm_vcpu *vcpu); extern void kvmppc_emulate_dec(struct kvm_vcpu *vcpu); extern u32 kvmppc_get_dec(struct kvm_vcpu *vcpu, u64 tb); extern void kvmppc_decrementer_func(unsigned long data); +extern void kvmppc_watchdog_func(unsigned long data); extern int kvmppc_sanity_check(struct kvm_vcpu *vcpu); /* Core-specific hooks */ @@ -104,6 +105,8 @@ extern void kvmppc_core_queue_external(struct kvm_vcpu *vcpu, struct kvm_interrupt *irq); extern void kvmppc_core_dequeue_external(struct kvm_vcpu *vcpu, struct kvm_interrupt *irq); +extern void kvmppc_core_queue_watchdog(struct kvm_vcpu *vcpu); +extern void kvmppc_core_dequeue_watchdog(struct kvm_vcpu *vcpu); extern int kvmppc_core_emulate_op(struct kvm_run *run, struct kvm_vcpu *vcpu, unsigned int op, int *advance); diff --git a/arch/powerpc/include/asm/reg_booke.h b/arch/powerpc/include/asm/reg_booke.h index 2d916c4..e07e6af 100644 --- a/arch/powerpc/include/asm/reg_booke.h +++ b/arch/powerpc/include/asm/reg_booke.h @@ -539,6 +539,13 @@ #define TCR_FIE0x0080 /* FIT Interrupt Enable */ #define TCR_ARE0x0040 /* Auto Reload Enable */ +#ifdef CONFIG_E500 +#define TCR_GET_WP(tcr) tcr) 0xC000) 30) | \ + (((tcr) 0x1E) 15)) +#else +#define TCR_GET_WP(tcr) (((tcr) 0xC000) 30) +#endif + /* Bit definitions for the TSR. */ #define TSR_ENW0x8000 /* Enable Next Watchdog */ #define TSR_WIS0x4000 /* WDT Interrupt Status */ diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index d25a097..9682506 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -206,6 +206,16 @@ void kvmppc_core_dequeue_external(struct kvm_vcpu *vcpu, clear_bit(BOOKE_IRQPRIO_EXTERNAL_LEVEL, vcpu-arch.pending_exceptions); } +void kvmppc_core_queue_watchdog(struct kvm_vcpu *vcpu) +{ + kvmppc_booke_queue_irqprio(vcpu, BOOKE_IRQPRIO_WATCHDOG); +} + +void kvmppc_core_dequeue_watchdog(struct kvm_vcpu *vcpu) +{ + clear_bit(BOOKE_IRQPRIO_WATCHDOG, vcpu-arch.pending_exceptions); +} + static void set_guest_srr(struct kvm_vcpu *vcpu, unsigned long srr0, u32 srr1) { #ifdef CONFIG_KVM_BOOKE_HV @@ -325,6 +335,7 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu, msr_mask = MSR_CE | MSR_ME | MSR_DE; int_class = INT_CLASS_NONCRIT; break; + case BOOKE_IRQPRIO_WATCHDOG: case BOOKE_IRQPRIO_CRITICAL: case BOOKE_IRQPRIO_DBELL_CRIT: allowed = vcpu-arch.shared-msr MSR_CE; @@ -404,12 +415,112 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu, return allowed; } +/* + * Return the number of jiffies until the next timeout. If the
[PATCH 3/3 v3] Enable kvm emulated watchdog
Enable the KVM emulated watchdog if KVM supports (use the capability enablement in watchdog handler). Also watchdog exit (KVM_EXIT_WATCHDOG) handling is added. Watchdog state machine is cleared whenever VM state changes to running. This is to handle the cases like return from debug halt etc. Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com --- v3: - TSR clearing is removed in whatchdog exit handling as this is no more needed. v2: - Merged ([PATCH 3/4] Watchdog exit handling support) and ([PATCH 4/4] Enable to use kvm emulated watchdog) - Clear watchdog state machine when VM state changes to running. hw/ppc_booke.c|5 +++ linux-headers/linux/kvm.h |2 + target-ppc/cpu.h |1 + target-ppc/kvm.c | 63 + 4 files changed, 71 insertions(+), 0 deletions(-) diff --git a/hw/ppc_booke.c b/hw/ppc_booke.c index 837a5b6..a9fba15 100644 --- a/hw/ppc_booke.c +++ b/hw/ppc_booke.c @@ -203,6 +203,11 @@ static void booke_wdt_cb(void *opaque) booke_timer-wdt_timer); } +void ppc_booke_wdt_clear_tsr(CPUPPCState *env, target_ulong tsr) +{ +env-spr[SPR_BOOKE_TSR] = tsr ~(TSR_ENW | TSR_WIS | TSR_WRS_MASK); +} + void store_booke_tsr(CPUPPCState *env, target_ulong val) { env-spr[SPR_BOOKE_TSR] = ~val; diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h index 4b9e575..64d94da 100644 --- a/linux-headers/linux/kvm.h +++ b/linux-headers/linux/kvm.h @@ -163,6 +163,7 @@ struct kvm_pit_config { #define KVM_EXIT_OSI 18 #define KVM_EXIT_PAPR_HCALL 19 #define KVM_EXIT_S390_UCONTROL 20 +#define KVM_EXIT_WATCHDOG 21 /* For KVM_EXIT_INTERNAL_ERROR */ #define KVM_INTERNAL_ERROR_EMULATION 1 @@ -618,6 +619,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_PPC_GET_SMMU_INFO 78 #define KVM_CAP_S390_COW 79 #define KVM_CAP_PPC_ALLOC_HTAB 80 +#define KVM_CAP_PPC_WDT81 #ifdef KVM_CAP_IRQ_ROUTING diff --git a/target-ppc/cpu.h b/target-ppc/cpu.h index ca2fc21..78212b4 100644 --- a/target-ppc/cpu.h +++ b/target-ppc/cpu.h @@ -1191,6 +1191,7 @@ void store_40x_dbcr0 (CPUPPCState *env, uint32_t val); void store_40x_sler (CPUPPCState *env, uint32_t val); void store_booke_tcr (CPUPPCState *env, target_ulong val); void store_booke_tsr (CPUPPCState *env, target_ulong val); +void ppc_booke_wdt_clear_tsr(CPUPPCState *env, target_ulong tsr); void booke206_flush_tlb(CPUPPCState *env, int flags, const int check_iprot); target_phys_addr_t booke206_tlb_to_page_size(CPUPPCState *env, ppcmas_tlb_t *tlb); int ppcmas_tlb_check(CPUPPCState *env, ppcmas_tlb_t *tlb, diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c index b6ef72d..393214e 100644 --- a/target-ppc/kvm.c +++ b/target-ppc/kvm.c @@ -32,6 +32,7 @@ #include device_tree.h #include hw/sysbus.h #include hw/spapr.h +#include hw/watchdog.h #include hw/sysbus.h #include hw/spapr.h @@ -60,6 +61,7 @@ static int cap_booke_sregs; static int cap_ppc_smt; static int cap_ppc_rma; static int cap_spapr_tce; +static int cap_ppc_wdt; /* XXX We have a race condition where we actually have a level triggered * interrupt, but the infrastructure can't expose that yet, so the guest @@ -86,6 +88,7 @@ int kvm_arch_init(KVMState *s) cap_ppc_smt = kvm_check_extension(s, KVM_CAP_PPC_SMT); cap_ppc_rma = kvm_check_extension(s, KVM_CAP_PPC_RMA); cap_spapr_tce = kvm_check_extension(s, KVM_CAP_SPAPR_TCE); +cap_ppc_wdt = kvm_check_extension(s, KVM_CAP_PPC_WDT); if (!cap_interrupt_level) { fprintf(stderr, KVM: Couldn't find level irq capability. Expect the @@ -168,6 +171,25 @@ static int kvm_booke206_tlb_init(CPUPPCState *env) return 0; } +static int kvm_wdt_enable(CPUPPCState *env) +{ +int ret; +struct kvm_enable_cap encap = {}; + +if (!kvm_enabled() || !cap_ppc_wdt) { +return 0; +} + +encap.cap = KVM_CAP_PPC_WDT; +ret = kvm_vcpu_ioctl(env, KVM_ENABLE_CAP, encap); +if (ret 0) { +fprintf(stderr, %s: couldn't enable KVM_CAP_PPC_WDT: %s\n, +__func__, strerror(-ret)); +return ret; +} + +return ret; +} #if defined(TARGET_PPC64) static void kvm_get_fallback_smmu_info(CPUPPCState *env, @@ -371,6 +393,33 @@ static inline void kvm_fixup_page_sizes(CPUPPCState *env) #endif /* !defined (TARGET_PPC64) */ +static void cpu_state_change_handler(void *opaque, int running, RunState state) +{ +CPUPPCState *env = opaque; + +struct kvm_sregs sregs; + +printf(running = %d, state = %d \n, running, state); +if (!running) +return; + +/* + * Clear watchdog interrupt condition by clearing TSR. + * Similar logic needed to be implemented for watchdog + * emulation in qemu. + */ +if (cap_booke_sregs cap_ppc_wdt) { +kvm_vcpu_ioctl(env, KVM_GET_SREGS, sregs); + +/* Clear TSR.ENW, TSR.WIS and TSR.WRS */ +ppc_booke_wdt_clear_tsr(env, sregs.u.e.tsr);
Re: [PATCH] msi/msix: added API to set MSI message address and data
On Thu, Jun 21, 2012 at 09:39:10PM +1000, Alexey Kardashevskiy wrote: Added (msi|msix)_set_message() functions. Currently msi_notify()/msix_notify() write to these vectors to signal the guest about an interrupt so the correct values have to written there by the guest or QEMU. For example, POWER guest never initializes MSI/MSIX vectors, instead it uses RTAS hypercalls. So in order to support MSIX for virtio-pci on POWER we have to initialize MSI/MSIX message from QEMU. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru So guests do enable MSI through config space, but do not fill in vectors? Very strange. Are you sure it's not just a guest bug? How does it work for other PCI devices? Can't we just fix guest drivers to program the vectors properly? Also pls address the comment below. Thanks! --- hw/msi.c | 13 + hw/msi.h |1 + hw/msix.c |9 + hw/msix.h |2 ++ 4 files changed, 25 insertions(+) diff --git a/hw/msi.c b/hw/msi.c index 5233204..cc6102f 100644 --- a/hw/msi.c +++ b/hw/msi.c @@ -105,6 +105,19 @@ static inline uint8_t msi_pending_off(const PCIDevice* dev, bool msi64bit) return dev-msi_cap + (msi64bit ? PCI_MSI_PENDING_64 : PCI_MSI_PENDING_32); } +void msi_set_message(PCIDevice *dev, MSIMessage msg) +{ +uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev)); +bool msi64bit = flags PCI_MSI_FLAGS_64BIT; + +if (msi64bit) { +pci_set_quad(dev-config + msi_address_lo_off(dev), msg.address); +} else { +pci_set_long(dev-config + msi_address_lo_off(dev), msg.address); +} +pci_set_word(dev-config + msi_data_off(dev, msi64bit), msg.data); +} + Please add documentation. Something like /* * Special API for POWER to configure the vectors through * a side channel. Should never be used by devices. */ bool msi_enabled(const PCIDevice *dev) { return msi_present(dev) diff --git a/hw/msi.h b/hw/msi.h index 75747ab..6ec1f99 100644 --- a/hw/msi.h +++ b/hw/msi.h @@ -31,6 +31,7 @@ struct MSIMessage { extern bool msi_supported; +void msi_set_message(PCIDevice *dev, MSIMessage msg); bool msi_enabled(const PCIDevice *dev); int msi_init(struct PCIDevice *dev, uint8_t offset, unsigned int nr_vectors, bool msi64bit, bool msi_per_vector_mask); diff --git a/hw/msix.c b/hw/msix.c index ded3c55..5f7d6d3 100644 --- a/hw/msix.c +++ b/hw/msix.c @@ -45,6 +45,15 @@ static MSIMessage msix_get_message(PCIDevice *dev, unsigned vector) return msg; } +void msix_set_message(PCIDevice *dev, int vector, struct MSIMessage msg) +{ +uint8_t *table_entry = dev-msix_table_page + vector * PCI_MSIX_ENTRY_SIZE; + +pci_set_quad(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR, msg.address); +pci_set_long(table_entry + PCI_MSIX_ENTRY_DATA, msg.data); +table_entry[PCI_MSIX_ENTRY_VECTOR_CTRL] = ~PCI_MSIX_ENTRY_CTRL_MASKBIT; +} + /* Add MSI-X capability to the config space for the device. */ /* Given a bar and its size, add MSI-X table on top of it * and fill MSI-X capability in the config space. diff --git a/hw/msix.h b/hw/msix.h index 50aee82..26a437e 100644 --- a/hw/msix.h +++ b/hw/msix.h @@ -4,6 +4,8 @@ #include qemu-common.h #include pci.h +void msix_set_message(PCIDevice *dev, int vector, MSIMessage msg); + int msix_init(PCIDevice *pdev, unsigned short nentries, MemoryRegion *bar, unsigned bar_nr, unsigned bar_size); -- 1.7.10 ps. double '-' and git version is an end-of-patch scissor as I read somewhere, cannot recall where exactly :) On 21/06/12 20:56, Jan Kiszka wrote: On 2012-06-21 12:50, Alexey Kardashevskiy wrote: On 21/06/12 20:38, Jan Kiszka wrote: On 2012-06-21 12:28, Alexey Kardashevskiy wrote: On 21/06/12 17:39, Jan Kiszka wrote: On 2012-06-21 09:18, Alexey Kardashevskiy wrote: agrhhh. sha1 of the patch changed after rebasing :) Added (msi|msix)_(set|get)_message() function for whoever might want to use them. Currently msi_notify()/msix_notify() write to these vectors to signal the guest about an interrupt so the correct values have to written there by the guest or QEMU. For example, POWER guest never initializes MSI/MSIX vectors, instead it uses RTAS hypercalls. So in order to support MSIX for virtio-pci on POWER we have to initialize MSI/MSIX message from QEMU. As only set* function are required by now, the get functions were added or made public for a symmetry. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- hw/msi.c | 29 + hw/msi.h |2 ++ hw/msix.c | 11 ++- hw/msix.h |3 +++ 4 files changed, 44 insertions(+), 1 deletion(-) diff --git a/hw/msi.c b/hw/msi.c index 5233204..9ad84a4 100644 --- a/hw/msi.c +++ b/hw/msi.c @@ -105,6 +105,35 @@ static inline uint8_t msi_pending_off(const PCIDevice* dev, bool
Re: [PATCH] msi/msix: added API to set MSI message address and data
On 18/07/12 22:43, Michael S. Tsirkin wrote: On Thu, Jun 21, 2012 at 09:39:10PM +1000, Alexey Kardashevskiy wrote: Added (msi|msix)_set_message() functions. Currently msi_notify()/msix_notify() write to these vectors to signal the guest about an interrupt so the correct values have to written there by the guest or QEMU. For example, POWER guest never initializes MSI/MSIX vectors, instead it uses RTAS hypercalls. So in order to support MSIX for virtio-pci on POWER we have to initialize MSI/MSIX message from QEMU. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru So guests do enable MSI through config space, but do not fill in vectors? Yes. msix_capability_init() calls arch_setup_msi_irqs() which does everything it needs to do (i.e. calls hypervisor) before msix_capability_init() writes PCI_MSIX_FLAGS_ENABLE to the PCI_MSIX_FLAGS register. These vectors are the PCI bus addresses, the way they are set is specific for a PCI host controller, I do not see why the current scheme is a bug. Very strange. Are you sure it's not just a guest bug? How does it work for other PCI devices? Did not get the question. It works the same for every PCI device under POWER guest. Can't we just fix guest drivers to program the vectors properly? Also pls address the comment below. Comment below. Thanks! --- hw/msi.c | 13 + hw/msi.h |1 + hw/msix.c |9 + hw/msix.h |2 ++ 4 files changed, 25 insertions(+) diff --git a/hw/msi.c b/hw/msi.c index 5233204..cc6102f 100644 --- a/hw/msi.c +++ b/hw/msi.c @@ -105,6 +105,19 @@ static inline uint8_t msi_pending_off(const PCIDevice* dev, bool msi64bit) return dev-msi_cap + (msi64bit ? PCI_MSI_PENDING_64 : PCI_MSI_PENDING_32); } +void msi_set_message(PCIDevice *dev, MSIMessage msg) +{ +uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev)); +bool msi64bit = flags PCI_MSI_FLAGS_64BIT; + +if (msi64bit) { +pci_set_quad(dev-config + msi_address_lo_off(dev), msg.address); +} else { +pci_set_long(dev-config + msi_address_lo_off(dev), msg.address); +} +pci_set_word(dev-config + msi_data_off(dev, msi64bit), msg.data); +} + Please add documentation. Something like /* * Special API for POWER to configure the vectors through * a side channel. Should never be used by devices. */ It is useful for any para-virtualized environment I believe, is not it? For s390 as well. Of course, if it supports PCI, for example, what I am not sure it does though :) bool msi_enabled(const PCIDevice *dev) { return msi_present(dev) diff --git a/hw/msi.h b/hw/msi.h index 75747ab..6ec1f99 100644 --- a/hw/msi.h +++ b/hw/msi.h @@ -31,6 +31,7 @@ struct MSIMessage { extern bool msi_supported; +void msi_set_message(PCIDevice *dev, MSIMessage msg); bool msi_enabled(const PCIDevice *dev); int msi_init(struct PCIDevice *dev, uint8_t offset, unsigned int nr_vectors, bool msi64bit, bool msi_per_vector_mask); diff --git a/hw/msix.c b/hw/msix.c index ded3c55..5f7d6d3 100644 --- a/hw/msix.c +++ b/hw/msix.c @@ -45,6 +45,15 @@ static MSIMessage msix_get_message(PCIDevice *dev, unsigned vector) return msg; } +void msix_set_message(PCIDevice *dev, int vector, struct MSIMessage msg) +{ +uint8_t *table_entry = dev-msix_table_page + vector * PCI_MSIX_ENTRY_SIZE; + +pci_set_quad(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR, msg.address); +pci_set_long(table_entry + PCI_MSIX_ENTRY_DATA, msg.data); +table_entry[PCI_MSIX_ENTRY_VECTOR_CTRL] = ~PCI_MSIX_ENTRY_CTRL_MASKBIT; +} + /* Add MSI-X capability to the config space for the device. */ /* Given a bar and its size, add MSI-X table on top of it * and fill MSI-X capability in the config space. diff --git a/hw/msix.h b/hw/msix.h index 50aee82..26a437e 100644 --- a/hw/msix.h +++ b/hw/msix.h @@ -4,6 +4,8 @@ #include qemu-common.h #include pci.h +void msix_set_message(PCIDevice *dev, int vector, MSIMessage msg); + int msix_init(PCIDevice *pdev, unsigned short nentries, MemoryRegion *bar, unsigned bar_nr, unsigned bar_size); -- 1.7.10 ps. double '-' and git version is an end-of-patch scissor as I read somewhere, cannot recall where exactly :) On 21/06/12 20:56, Jan Kiszka wrote: On 2012-06-21 12:50, Alexey Kardashevskiy wrote: On 21/06/12 20:38, Jan Kiszka wrote: On 2012-06-21 12:28, Alexey Kardashevskiy wrote: On 21/06/12 17:39, Jan Kiszka wrote: On 2012-06-21 09:18, Alexey Kardashevskiy wrote: agrhhh. sha1 of the patch changed after rebasing :) Added (msi|msix)_(set|get)_message() function for whoever might want to use them. Currently msi_notify()/msix_notify() write to these vectors to signal the guest about an interrupt so the correct values have to written there by the guest or QEMU. For example, POWER guest never initializes MSI/MSIX vectors, instead
Re: [PATCH] msi/msix: added API to set MSI message address and data
On Wed, Jul 18, 2012 at 11:17:12PM +1000, Alexey Kardashevskiy wrote: On 18/07/12 22:43, Michael S. Tsirkin wrote: On Thu, Jun 21, 2012 at 09:39:10PM +1000, Alexey Kardashevskiy wrote: Added (msi|msix)_set_message() functions. Currently msi_notify()/msix_notify() write to these vectors to signal the guest about an interrupt so the correct values have to written there by the guest or QEMU. For example, POWER guest never initializes MSI/MSIX vectors, instead it uses RTAS hypercalls. So in order to support MSIX for virtio-pci on POWER we have to initialize MSI/MSIX message from QEMU. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru So guests do enable MSI through config space, but do not fill in vectors? Yes. msix_capability_init() calls arch_setup_msi_irqs() which does everything it needs to do (i.e. calls hypervisor) before msix_capability_init() writes PCI_MSIX_FLAGS_ENABLE to the PCI_MSIX_FLAGS register. These vectors are the PCI bus addresses, the way they are set is specific for a PCI host controller, I do not see why the current scheme is a bug. I won't work with any real PCI device, will it? Real pci devices expect vectors to be written into their memory. Very strange. Are you sure it's not just a guest bug? How does it work for other PCI devices? Did not get the question. It works the same for every PCI device under POWER guest. I mean for real PCI devices. Can't we just fix guest drivers to program the vectors properly? Also pls address the comment below. Comment below. Thanks! --- hw/msi.c | 13 + hw/msi.h |1 + hw/msix.c |9 + hw/msix.h |2 ++ 4 files changed, 25 insertions(+) diff --git a/hw/msi.c b/hw/msi.c index 5233204..cc6102f 100644 --- a/hw/msi.c +++ b/hw/msi.c @@ -105,6 +105,19 @@ static inline uint8_t msi_pending_off(const PCIDevice* dev, bool msi64bit) return dev-msi_cap + (msi64bit ? PCI_MSI_PENDING_64 : PCI_MSI_PENDING_32); } +void msi_set_message(PCIDevice *dev, MSIMessage msg) +{ +uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev)); +bool msi64bit = flags PCI_MSI_FLAGS_64BIT; + +if (msi64bit) { +pci_set_quad(dev-config + msi_address_lo_off(dev), msg.address); +} else { +pci_set_long(dev-config + msi_address_lo_off(dev), msg.address); +} +pci_set_word(dev-config + msi_data_off(dev, msi64bit), msg.data); +} + Please add documentation. Something like /* * Special API for POWER to configure the vectors through * a side channel. Should never be used by devices. */ It is useful for any para-virtualized environment I believe, is not it? For s390 as well. Of course, if it supports PCI, for example, what I am not sure it does though :) I expect the normal guest to program the address into MSI register using config accesses, same way that it enables MSI/MSIX. Why POWER does it differently I did not yet figure out but I hope this weirdness is not so widespread. bool msi_enabled(const PCIDevice *dev) { return msi_present(dev) diff --git a/hw/msi.h b/hw/msi.h index 75747ab..6ec1f99 100644 --- a/hw/msi.h +++ b/hw/msi.h @@ -31,6 +31,7 @@ struct MSIMessage { extern bool msi_supported; +void msi_set_message(PCIDevice *dev, MSIMessage msg); bool msi_enabled(const PCIDevice *dev); int msi_init(struct PCIDevice *dev, uint8_t offset, unsigned int nr_vectors, bool msi64bit, bool msi_per_vector_mask); diff --git a/hw/msix.c b/hw/msix.c index ded3c55..5f7d6d3 100644 --- a/hw/msix.c +++ b/hw/msix.c @@ -45,6 +45,15 @@ static MSIMessage msix_get_message(PCIDevice *dev, unsigned vector) return msg; } +void msix_set_message(PCIDevice *dev, int vector, struct MSIMessage msg) +{ +uint8_t *table_entry = dev-msix_table_page + vector * PCI_MSIX_ENTRY_SIZE; + +pci_set_quad(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR, msg.address); +pci_set_long(table_entry + PCI_MSIX_ENTRY_DATA, msg.data); +table_entry[PCI_MSIX_ENTRY_VECTOR_CTRL] = ~PCI_MSIX_ENTRY_CTRL_MASKBIT; +} + /* Add MSI-X capability to the config space for the device. */ /* Given a bar and its size, add MSI-X table on top of it * and fill MSI-X capability in the config space. diff --git a/hw/msix.h b/hw/msix.h index 50aee82..26a437e 100644 --- a/hw/msix.h +++ b/hw/msix.h @@ -4,6 +4,8 @@ #include qemu-common.h #include pci.h +void msix_set_message(PCIDevice *dev, int vector, MSIMessage msg); + int msix_init(PCIDevice *pdev, unsigned short nentries, MemoryRegion *bar, unsigned bar_nr, unsigned bar_size); -- 1.7.10 ps. double '-' and git version is an end-of-patch scissor as I read somewhere, cannot recall where exactly :) On 21/06/12 20:56, Jan Kiszka wrote: On 2012-06-21 12:50,
Re: [PATCH 0/8] KVM: Optimize MMU notifier's THP page invalidation -v4
On Mon, Jul 02, 2012 at 05:52:39PM +0900, Takuya Yoshikawa wrote: v3-v4: Resolved trace_kvm_age_page() issue -- patch 6,7 v2-v3: Fixed intersection calculations. -- patch 3, 8 Takuya Takuya Yoshikawa (8): KVM: MMU: Use __gfn_to_rmap() to clean up kvm_handle_hva() KVM: Introduce hva_to_gfn_memslot() for kvm_handle_hva() KVM: MMU: Make kvm_handle_hva() handle range of addresses KVM: Introduce kvm_unmap_hva_range() for kvm_mmu_notifier_invalidate_range_start() KVM: Separate rmap_pde from kvm_lpage_info-write_count KVM: MMU: Add memslot parameter to hva handlers KVM: MMU: Push trace_kvm_age_page() into kvm_age_rmapp() KVM: MMU: Avoid handling same rmap_pde in kvm_handle_hva_range() arch/powerpc/include/asm/kvm_host.h |2 + arch/powerpc/kvm/book3s_64_mmu_hv.c | 47 --- arch/x86/include/asm/kvm_host.h |3 +- arch/x86/kvm/mmu.c | 107 +++ arch/x86/kvm/x86.c | 11 include/linux/kvm_host.h|8 +++ virt/kvm/kvm_main.c |3 +- 7 files changed, 131 insertions(+), 50 deletions(-) From v2: The new test result was impressively good, see below, and THP page invalidation was more than 5 times faster on my x86 machine. Before: ... 19.852 us | __mmu_notifier_invalidate_range_start(); 28.033 us | __mmu_notifier_invalidate_range_start(); 19.066 us | __mmu_notifier_invalidate_range_start(); 44.715 us | __mmu_notifier_invalidate_range_start(); 31.613 us | __mmu_notifier_invalidate_range_start(); 20.659 us | __mmu_notifier_invalidate_range_start(); 19.979 us | __mmu_notifier_invalidate_range_start(); 20.416 us | __mmu_notifier_invalidate_range_start(); 20.632 us | __mmu_notifier_invalidate_range_start(); 22.316 us | __mmu_notifier_invalidate_range_start(); ... After: ... 4.089 us | __mmu_notifier_invalidate_range_start(); 4.096 us | __mmu_notifier_invalidate_range_start(); 3.560 us | __mmu_notifier_invalidate_range_start(); 3.376 us | __mmu_notifier_invalidate_range_start(); 3.772 us | __mmu_notifier_invalidate_range_start(); 3.353 us | __mmu_notifier_invalidate_range_start(); 3.332 us | __mmu_notifier_invalidate_range_start(); 3.332 us | __mmu_notifier_invalidate_range_start(); 3.332 us | __mmu_notifier_invalidate_range_start(); 3.337 us | __mmu_notifier_invalidate_range_start(); ... Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] msi/msix: added API to set MSI message address and data
On 19/07/12 01:23, Michael S. Tsirkin wrote: On Wed, Jul 18, 2012 at 11:17:12PM +1000, Alexey Kardashevskiy wrote: On 18/07/12 22:43, Michael S. Tsirkin wrote: On Thu, Jun 21, 2012 at 09:39:10PM +1000, Alexey Kardashevskiy wrote: Added (msi|msix)_set_message() functions. Currently msi_notify()/msix_notify() write to these vectors to signal the guest about an interrupt so the correct values have to written there by the guest or QEMU. For example, POWER guest never initializes MSI/MSIX vectors, instead it uses RTAS hypercalls. So in order to support MSIX for virtio-pci on POWER we have to initialize MSI/MSIX message from QEMU. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru So guests do enable MSI through config space, but do not fill in vectors? Yes. msix_capability_init() calls arch_setup_msi_irqs() which does everything it needs to do (i.e. calls hypervisor) before msix_capability_init() writes PCI_MSIX_FLAGS_ENABLE to the PCI_MSIX_FLAGS register. These vectors are the PCI bus addresses, the way they are set is specific for a PCI host controller, I do not see why the current scheme is a bug. I won't work with any real PCI device, will it? Real pci devices expect vectors to be written into their memory. Yes. And the hypervisor does this. On POWER (at least book3s - server powerpc, the whole config space kitchen is hidden behind RTAS (kind of bios). For the guest, this RTAS is implemented in hypervisor, for the host - in the system firmware. So powerpc linux does not have to have PHB drivers. Kinda cool. Usual powerpc server is running without the host linux at all, it is running a hypervisor called pHyp. And every guest knows that it is a guest, there is no full machine emulation, it is para-virtualization. In power-kvm, we replace that pHyp with the host linux and now QEMU plays a hypervisor role. Some day We will move the hypervisor to the host kernel completely (?) but now it is in QEMU. Very strange. Are you sure it's not just a guest bug? How does it work for other PCI devices? Did not get the question. It works the same for every PCI device under POWER guest. I mean for real PCI devices. Can't we just fix guest drivers to program the vectors properly? Also pls address the comment below. Comment below. Thanks! --- hw/msi.c | 13 + hw/msi.h |1 + hw/msix.c |9 + hw/msix.h |2 ++ 4 files changed, 25 insertions(+) diff --git a/hw/msi.c b/hw/msi.c index 5233204..cc6102f 100644 --- a/hw/msi.c +++ b/hw/msi.c @@ -105,6 +105,19 @@ static inline uint8_t msi_pending_off(const PCIDevice* dev, bool msi64bit) return dev-msi_cap + (msi64bit ? PCI_MSI_PENDING_64 : PCI_MSI_PENDING_32); } +void msi_set_message(PCIDevice *dev, MSIMessage msg) +{ +uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev)); +bool msi64bit = flags PCI_MSI_FLAGS_64BIT; + +if (msi64bit) { +pci_set_quad(dev-config + msi_address_lo_off(dev), msg.address); +} else { +pci_set_long(dev-config + msi_address_lo_off(dev), msg.address); +} +pci_set_word(dev-config + msi_data_off(dev, msi64bit), msg.data); +} + Please add documentation. Something like /* * Special API for POWER to configure the vectors through * a side channel. Should never be used by devices. */ It is useful for any para-virtualized environment I believe, is not it? For s390 as well. Of course, if it supports PCI, for example, what I am not sure it does though :) I expect the normal guest to program the address into MSI register using config accesses, same way that it enables MSI/MSIX. Why POWER does it differently I did not yet figure out but I hope this weirdness is not so widespread. In para-virt I would expect the guest not to touch config space at all. At least it should use one interface rather than two but this is how it is. bool msi_enabled(const PCIDevice *dev) { return msi_present(dev) diff --git a/hw/msi.h b/hw/msi.h index 75747ab..6ec1f99 100644 --- a/hw/msi.h +++ b/hw/msi.h @@ -31,6 +31,7 @@ struct MSIMessage { extern bool msi_supported; +void msi_set_message(PCIDevice *dev, MSIMessage msg); bool msi_enabled(const PCIDevice *dev); int msi_init(struct PCIDevice *dev, uint8_t offset, unsigned int nr_vectors, bool msi64bit, bool msi_per_vector_mask); diff --git a/hw/msix.c b/hw/msix.c index ded3c55..5f7d6d3 100644 --- a/hw/msix.c +++ b/hw/msix.c @@ -45,6 +45,15 @@ static MSIMessage msix_get_message(PCIDevice *dev, unsigned vector) return msg; } +void msix_set_message(PCIDevice *dev, int vector, struct MSIMessage msg) +{ +uint8_t *table_entry = dev-msix_table_page + vector * PCI_MSIX_ENTRY_SIZE; + +pci_set_quad(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR, msg.address); +pci_set_long(table_entry + PCI_MSIX_ENTRY_DATA, msg.data); +table_entry[PCI_MSIX_ENTRY_VECTOR_CTRL] =
RE: [PATCH 2/2 v4] KVM: PPC: booke: Add watchdog emulation
-Original Message- From: Wood Scott-B07421 Sent: Thursday, July 19, 2012 7:56 AM To: Bhushan Bharat-R65777 Cc: kvm-ppc@vger.kernel.org; k...@vger.kernel.org; ag...@suse.de; Bhushan Bharat- R65777 Subject: Re: [PATCH 2/2 v4] KVM: PPC: booke: Add watchdog emulation On 07/18/2012 04:39 AM, Bharat Bhushan wrote: This patch adds the watchdog emulation in KVM. The watchdog emulation is enabled by KVM_ENABLE_CAP(KVM_CAP_PPC_WDT) ioctl. The kernel timer are used for watchdog emulation and emulates h/w watchdog state machine. On watchdog timer expiry, it exit to QEMU if TCR.WRC is non ZERO. QEMU can reset/shutdown etc depending upon how it is configured. Signed-off-by: Liu Yu yu@freescale.com Signed-off-by: Scott Wood scottw...@freescale.com Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com Please put a note before your signoff stating that it's been modified since the previous signoffs, such as: [bharat.bhus...@freescale.com: reworked patch] @@ -482,6 +484,7 @@ struct kvm_vcpu_arch { u8 osi_needed; u8 osi_enabled; u8 papr_enabled; + u8 watchdog_enable; s/enable;/enabled;/ +static void arm_next_watchdog(struct kvm_vcpu *vcpu) +{ + unsigned long nr_jiffies; + + nr_jiffies = watchdog_next_timeout(vcpu); + spin_lock(vcpu-arch.wdt_lock); Do watchdog_next_timeout() from within the lock. Why you think so? Is the next timeout calculation needed to be protected? +void kvmppc_watchdog_func(unsigned long data) +{ + struct kvm_vcpu *vcpu = (struct kvm_vcpu *)data; + u32 tsr, new_tsr; + int final; + + do { + new_tsr = tsr = vcpu-arch.tsr; + final = 0; + + /* Time out event */ + if (tsr TSR_ENW) { + if (tsr TSR_WIS) { + final = 1; + } else { + new_tsr = tsr | TSR_WIS; + } Unnecessary braces on the inner if/else. + } else { + new_tsr = tsr | TSR_ENW; + } + } while (cmpxchg(vcpu-arch.tsr, tsr, new_tsr) != tsr); + + if (new_tsr TSR_WIS) { + smp_wmb(); + kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu); + kvm_vcpu_kick(vcpu); + } + + /* +* If this is final watchdog expiry and some action is required +* then exit to userspace. +*/ + if (final (vcpu-arch.tcr TCR_WRC_MASK)) { + smp_wmb(); + kvm_make_request(KVM_REQ_WATCHDOG, vcpu); + kvm_vcpu_kick(vcpu); + } + + + /* +* Stop running the watchdog timer after final expiration to +* prevent the host from being flooded with timers if the +* guest sets a short period. +* Timers will resume when TSR/TCR is updated next time. +*/ + if (!final) + arm_next_watchdog(vcpu); +} static void update_timer_ints(struct kvm_vcpu *vcpu) Blank line between functions static void kvmppc_core_check_exceptions(struct kvm_vcpu *vcpu) @@ -516,6 +627,13 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu) goto out; } + if (kvm_check_request(KVM_REQ_WATCHDOG, vcpu) + vcpu-arch.watchdog_enable) { Check .watchdog_enabled when you make the request, not here. + kvm_run-exit_reason = KVM_EXIT_WATCHDOG; + ret = 0; + goto out; + } I think we should check that ENW/WIS are still set. Now that we don't use TSR[WRS], this is the only way QEMU can preemptively clear a pending final expiration after leaving debug halt. If QEMU doesn't do this, and KVM comes back with a watchdog exit, QEMU won't know if it's a new legitimate one, or if it expired during the debug halt. This would be less ugly than QEMU doing a reset-TSR[WIS,ENW]-and-ignore on the first watchdog exit after a debug halt, and letting the expiration happen again if the guest is really stuck. ahh, What about clearing KVM_REQ_WATCHDOG request when QEMU clears ENW|WIS (or whenever KVM or QEMU clears ENW|WIS)? Thanks -Bharat if (sregs-u.e.update_special KVM_SREGS_E_UPDATE_TSR) { + u32 old_tsr = vcpu-arch.tsr; + vcpu-arch.tsr = sregs-u.e.tsr; + + if ((old_tsr ^ vcpu-arch.tsr) + (TSR_ENW | TSR_WIS | TCR_WRC_MASK)) + arm_next_watchdog(vcpu); Do we still do anything with TSR[WRS] at all? diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 87f4dc8..3c50b81 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -38,9 +38,11 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *v) { - return !(v-arch.shared-msr MSR_WE) || - !!(v-arch.pending_exceptions) || - v-requests; + bool ret = !(v-arch.shared-msr MSR_WE) || +